OpenBLAS/Changelog.txt at develop · OpenMathLib/OpenBLAS · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
OpenBLAS ChangeLog
====================================================================
Version 0.3.31
15-Jan-2025

general:
 - reverted a matrix partitioning optimization from 0.3.30 that could lead to
   race conditions and subsequent invalid results in GEMM
 - added the bfloat16 extensions BGEMM and BGEMV
 - added a BLAS interface for the ?GEMM_BATCH extensions
 - added the BLAS extensions ?GEMM_BATCH_STRIDED and their CBLAS interface
 - added the basic infrastructure for half-precision float (FP16) format
   using SH prefix
 - reimplemented the LAPACK SLAED3/DLAED3 function using multithreading, thereby
   improving the performance of the SSYEVD/DSYEVD eigensolver for symmetric matrices
   on all platforms
 - limited the number of retries for initial memory allocation to avoid infinite
   hanging on low-memory systems
 - fixed a thread lockup situation encountered with python 3.9 or older and numpy
 - introduced a problem size threshold for multithreading in STRMV/DTRMV
 - introduced a problem size threshold for multithreading in CHER/CHER2/CHPR/CHPR2
   and ZHER/ZHER2/ZHPR/ZHPR2
 - improved the problem size thresholds for multithreading in SGER/DGER
 - improved autodetection of the Fortran compiler
 - fixed passing of the INTERFACE64=1 option to the flang-new compiler
 - fixed a potential deadlock in multithreaded code after calling fork()
 - fixed builds using CMake on FreeBSD
 - fixed builds using CMake from within Cygwin on Windows
 - fixed builds using CMake and the NVHPC compiler on ARM64
 - fixed CMake build error from misdetecting compiler or OpenMP versions
 - improved contents of the CMake-generated OpenBLASConfig.cmake file
 - added support for cross-compilation to RISCV targets via CMake
 - fixed cross-compilation to x86 targets from non-x86 architectures
 - fixed failure to install cblas.h if NO_CBLAS=0 was specified
 - fixed missing user-defined pre- and postfixes on functions in lapack.h,lapacke.h
 - included fixes from the Reference-LAPACK project:
   - fix ordering bug in ?LAED/?LASD (Reference-LAPACK PR 1140)
   - revert changes in ?GEEV from PR 1129 (Reference-LAPACK PR 1142)
   - fix workspace allocation in LAPACKE_?TRSEN (Reference-LAPACK PR 1144)

riscv:
 - added optimized SBGEMM kernels for ZVL128B and ZVL256B targets
 - added optimized SHGEMM kernels for ZVL128B and ZVL256B targets
 - added optimized SBGEMV and SHGEMV kernels for ZVL128B/ZVL256B
 - improved performance of the GEMV kernel for ZVL256B
 - improved the performance of the CROT and ZROT kernels for ZVL128B and x280
 - improved the detection of RVV1.0 capability
 - improved performance of the matrix packing helper functions for ZVL128B and ZVL256B
 - improved performance of OMATCOPY for ZVL128B and ZVL256B

arm:
 - fixed spurious executable stack in the getarch utility

arm64:
 - fixed spurious executable stack in the getarch utility
 - fixed compiler warnings arising from the timer macro RPCC
 - fixed cache size detection for Qualcomm Oryon under Windows on Arm
 - fixed argument handling in the default SVE kernel for SDOT/DDOT
 - building the BFLOAT16 kernels is now enabled by default
 - improved the overall performance of GEMM,SYMM and HEMM on A64FX
 - improved the performance of SDOT/DDOT on A64FX
 - improved the multithreading performance of SDOT/DDOT on A64FX by
   introduction of a throttling table matching thread count to problem size
 - improved the performance of SGER/DGER on A64FX and NEOVERSEV1
 - improved the multithreading performance of GEMM on A64FX and NEOVERSEV1
 - improved the performance of the GEMV kernel for SVE-capable targets
 - improved the multithreading performance of SGEMM on NEOVERSEV1 and V2
 - added optimized SAXPY/DAXPY SVE kernels for A64FX and NEOVERSEV1
 - added optimized BGEMM and BGEMV kernels for NEOVERSEV1
 - added an optimized BGEMM kernel for NEOVERSEN2
 - added support for the NEOVERSEV2 cpu
 - added dedicated support for the Apple M4 cpu as VORTEXM4
 - added optimized SGEMM/SSYMM/STRMM/SSYRK/SSYR2K for SME-capable targets
   (ARMV9SME and VORTEXM4)
 - improved the precision of the SNRM2 kernel
 - added cpu autodetection and compiler settings for Ampere One processors
 - fixed cpu autodetection for Apple M systems running Linux
 - fixed building on MacOS with AppleClang,gfortran and xcode v16 or newer
 - fixed several errors in the C code replacements for the complex and double
   precision complex LAPACK functions that get used (only) when compiling with
   Microsoft C and NOFORTRAN=1 under MS Windows

power:
 - added initial support for the POWER11 architecture
 - improved performance of DGEMM and DGEMV on POWER10
 - fixed the default compiler flags to use "-O3" instead of the possibly unsafe
   "-Ofast"
 - fixed building under MacOS (for old G4 Macs) with CMake
 - fixed potential miscompilation of DGEMV and other assembly kernels by gcc15.1
 - fixed compilation with recent versions of flang

loongarch64:
 - fixed warnings and potential inaccuracies arising from incorrect saving of registers
 - fixed enumeration of logical cores on big NUMA servers
 - fixed building with LLVM and the INTERFACE64=1 option

x86:
 - fixed building the GEMM3M kernels for the GENERIC target
 - fixed several errors in the C code replacements for the complex and double
   precision complex LAPACK functions that get used (only) when compiling with
   Microsoft C and NOFORTRAN=1 under MS Windows

x86_64:
 - added cpu autodetection for Intel Lunar Lake (Core Ultra 200V)
 - changed all ?MIN and ?MAX assembly kernels to use unaligned operations
 - fixed several errors in the C code replacements for the complex and double
   precision complex LAPACK functions that get used (only) when compiling with
   Microsoft C and NOFORTRAN=1 under MS Windows
 - fixed potential crashes in builds for Cooper Lake, Sapphire Rapids or Zen5 cpus
   under MS Windows

zarch:
 - added support for building with CMake

sparc:
 - fixed a potential crash in the DNRM2 kernel

====================================================================
Version 0.3.30
19-Jun-2025

general:
 - fixed an installation problem with the thread safety test in gmake builds
 - fixed spurious overwriting of an input array in complex GEMMT/GEMMTR
 - fixed naming of GEMMTR in error messages from XERBLA
 - fixed compilation of SBGEMMT/SBGEMMTR in CMake builds
 - fixed the implementation of ?NRM2 to handle INCX=0 correctly
 - removed tests for CSROT and ZDROT that relied on unspecified behavior
 - fixed a performance regression in multithreaded GEMM that was particularly
   serious on POWER targets
 - fixed linking issues when using LLVM's flang-new with gmake
 - fixed a potential thread safety problem with C11 atomic operations
 - further improved the workload partitioning in parallel GEMM
 - fixed omission of LAPACKE interfaces for CGESVDQ,CTRSYL3 and ?GEQPF in
   CMake builds
 - fixed mishandling of setting NO_LAPACK to FALSE, and incorrect dependencies
   for LAPACK function SPMV in CMake builds
 - added explicit CMake options for building LAPACKE and shared libraries
 - simplified and improved handling of OpenMP options in CMake builds
 - reworked Windows DLL generation in CMake builds to ensure correct symbol
   renaming (pre/postfixing) and optional generation of PDB files for debugging
 - updated the Perl script version of the gensymbol utility for use with
   Windows-on-Arm
 - Fixed building with (Mingw) gmake on Windows to ensure completeness of the
   LAPACK included in the static library (potential race condition due to the
   Windows version of the "ln" utility creating snapshot copies rather than links)
 - fixed unwanted deletion of the lapacke_mangling.h file by "make clean"
 - fixed potential duplication of a _64 suffix on library names in CMake builds
 - fixed compilation of the C fallback copies of the LAPACK code with GCC 15
 - included fixed from the Reference-LAPACK project:
   - fixed a truncated error message in the EIG part of the testsuite
     (Reference-LAPACK PR 1119)
   - fixed too strict check in LAPACKE_?gesdd_work (PR #1126)
   - fixed memory corruption when calling ?GEEV with non-finite data (PR #1128)
   - fixed missing initialization of a variable in C/GEQP3RK (PR #1131)
   - fixed 2nd dimension chosen in C/ZUNMLQ transposition operation (PR #1135)

x86_64:
 - fixed an error in the SBGEMV kernel for Cooper Lake/Sapphire Rapids
 - fixed corner cases of NAN and INF input handling in CSCAL and ZSCAL
 - improved the compiler identification code for flang-new
 - fixed a potential build issue in the ZSUM kernel
 - fixed "argument list too long" errors when building on MacOS
 - added cpu autodetection support for several new Arrow Lake models
 - fixed conditional inclusion of the fast path SGEMM kernel in DYNAMIC_ARCH
 - fixed compilation with the MinGW build of GCC 15

arm64:
 - fixed cpu type detection of A64FX and some ThunderX models (broken in 0.3.29)
 - added support for the AmpereOne/1A cpus in DYNAMIC_ ARCH builds
 - added an optimized SBGEMM kernel for NEOVERSEV1
 - improved 1xN SBGEMM performance by forwarding to SBGEMV
 - introduced a stepwise increase of the thread count used for
   SGEMM and SGEMV on NEOVERSEV1/V2 in relation to problem size
 - introduced a stepwise increase of the thread count used for
   DGEMV on NEOVERSEV1 in relation to problem size
 - introduced a stepwise increase of the thread count used for
   SDOT and DDOT on NEOVERSEV1 in relation to problem size
 - worked around assembler limitations in LLVM for Windows-on-Arm
 - enabled cpu type autodetection from the registry on Windows-on-Arm
 - improved multithreading threshold for GEMV and GESV on Windows-on-Arm
 - fixed overoptimization issues with LLVM's flang in Windows-on-Arm
 - fixed corner cases of NAN and INF input handling in CSCAL and ZSCAL
 - added a fast path SGEMM kernel for small workloads on SME capable targets
 - improved performance of SGEMM and DGEMM kernels for small workloads
 - improved performance of SGEMV and DGEMV on SVE-capable targets
 - improved performance of SGEMV on NEOVERSEN1 and Apple M
 - added optimized SSYMV and DSYMV kernels for NEOVERSEN1, Apple M and all
   SVE capable targets
 - added optimized SBGEMV kernels for NEOVERSEV1/V2/N2
 - improved performance of SGEMM through faster NCOPY kernels
 - added compiler options for the NVIDIA HPC Compiler Suite
 - fixed compilation on OSX with XCode 16.3 and later
 - fixed cpu core type and cache size detection on Apple M4
 - updated GEMM parameter settings for Neoverse cpus in cross-builds with CMake
 - fixed default compiler options for NEOVERSEN1 and CORTEXX2 in CMake builds
 - fixed conditional inclusion of the fast path SGEMM kernel in DYNAMIC_ARCH
 - fixed potential miscompilation of the non-SVE SDOT kernel

riscv64:
 - added optimized SROTM and DROTM kernels for x280
 - fixed corner cases of NAN and INF input handling in CSCAL and ZSCAL
 - improved performance of GEMM_TCOPY on RVV1.0 targets with
   VLEN of 128 or 256
 - improved performance of OMATCOPY on targets with VLEN 256
 - greatly improved performance of SGEMV/DGEMV
 - improved performance of CGEMV and ZGEMV on C910V and all RVV targets
   with VLEN 256
 - improved performance of SAXPBY and DAXPBY on C910V and all RVV targets
   with VLEN 256
 - improved performance of AXPY and DOT on C910V and ZVL256B targets by
   falling back to non-vectorized code for very small N. (Thereby fixing
   poor performance of CHBMV/ZHBMV for very small K)
 - fixed CMake build failures of the TRMM kernels

loongarch64:
 - improved performance of the LSX versions of SSYMV/DSYMV
 - made the LASX versions of the DSYMV and SSYMV kernels
   compatible with hardware changes in LA664 and future targets
 - fixed inaccuracies in several LASX kernels
 - improved compatibility of LSX kernels with LA264 targets
 - fixed handling of deprecated target names in CMake builds
 - fixed corner cases of NAN and INF input handling in CSCAL and ZSCAL

power:
 - fixed building for PPCG4 with CMake
 - fixed SSCAL/DSCAL on PPC970 running FreeBSD
 - fixed a potential alignment issue in the POWER8 SGEMV kernel
 - fixed corner cases of NAN and INF input handling in CSCAL and ZSCAL

zarch:
 - fixed corner cases of NAN and INF input handling in CSCAL and ZSCAL
 - fixed unwanted generation of object files with a writable stack

x86:
 - fixed corner cases of NAN and INF input handling in CSCAL and ZSCAL
 - worked around potential miscompilation of CDOT with very old binutils

arm:
 - fixed corner cases of NAN and INF input handling in CSCAL and ZSCAL
 - fixed unwanted generation of object files with a writable stack

sparc:
 - fixed corner cases of NAN and INF input handling in CSCAL and ZSCAL

alpha:
 - fixed build failure caused by spurious Windows-only typecasts

cell:
 - fixed probable build issue caused by spurious Windows-only typecasts

====================================================================
Version 0.3.29
12-Jan-2025

general:
 - fixed a potential NULL pointer dereference in multithreaded builds
 - added function aliases for GEMMT using its new name GEMMTR adopted by Reference-BLAS
 - fixed a build failure when building without LAPACK_DEPRECATED functions
 - the minimum required CMake version for CMake-based builds was raised to 3.16.0 in order
   to remove many compatibility and deprecation warnings
 - added more detailed CMake rules for OpenMP builds (mainly to support recent LLVM)
 - fixed the behavior of the recently added CBLAS_?GEMMT functions with row-major data
 - improved thread scaling of multithreaded SBGEMV
 - improved thread scaling of multithreaded TRTRI
 - fixed compilation of the CBLAS testsuite with gcc14 (and no Fortran compiler)
 - added support for option handling changes in flang-new from LLVM18 onwards
 - added support for recent calling conventions changes in Cray and NVIDIA compilers
 - added support for compilation with the NAG Fortran compiler
 - fixed placement of the -fopenmp flag and libsuffix in the generated pkgconfig file
 - improved the CMakeConfig file generated by the Makefile build
 - fixed const-correctness of cblas_?geadd in cblas.h
 - fixed a potential inaccuracy in multithreaded BLAS3 calls
 - fixed empty implementations of get/set_affinity that print a warning in OpenMP builds
 - fixed function signatures for TRTRS in the converted C version of LAPACK
 - fixed omission of several single-precision LAPACK symbols in the shared library
 - improved build instructions for the provided "pybench" benchmarks
 - improved documentation, including added build instructions for WoA and HarmonyOS
   as well as descriptions of environment variables that affect build and runtime behavior
 - added a separate "make install_tests" target for use with cross-compilations
 - integrated improvements and corrections from Reference-LAPACK:
   - removed a comparison in LAPACKE ?tpmqrt that is always false (LAPACK PR 1062)
   - fixed the leading dimension for B in tests for GGEV (LAPACK PR 1064)
   - replaced the ?LARFT functions with a recursive implementation (LAPACK PR 1080)

arm:
 - fixed build with recent versions of the NDK (missing .type declaration of symbols)

arm64:
 - fixed a long-standing bug in the (generic) c/zgemm_beta kernel that could lead to
   reads and writes outside the array bounds in some circumstances
 - rewrote cpu autodetection to scan all cores and return the highest performing type
 - improved the DGEMM performance for SVE targets and small matrix sizes
 - improved dimension criteria for forwarding from GEMM to GEMV kernels
 - added SVE kernels for ROT and SWAP
 - improved SVE kernels for SGEMV and DGEMV on A64FX and NEOVERSEV1
 - added support for using the "small matrix" kernels with CMake as well
 - fixed compilation on Windows on Arm
 - improved compile-time detection of SVE capability
 - added cpu autodetection and initial support for Apple M4
 - added support for compilation on systems running IOS
 - added support for compilation on NetBSD ("evbarm" architecture)
 - fixed NRM2 implementations for generic SVE targets and the Neoverse N2
 - fixed compilation for SVE-capable targets with the NVIDIA compiler

x86_64:
 - fixed a wrong storage size in the SBGEMV kernel for Cooper Lake
 - added cpu autodetection for Intel Granite Rapids
 - added cpu autodetection for AMD Ryzen 5 series
 - added optimized SOMATCOPY_CT for AVX-capable targets
 - fixed the fallback implementation of GEMM3M in GENERIC builds
 - tentatively re-enabled builds with the EXPRECISION option
 - worked around a miscompilation of tests with mingw32-gfortran14
 - added support for compilation with the Intel oneAPI 2025.0 compiler on Windows

power:
 - fixed multithreaded SBGEMM
 - fixed a CMake build problem on POWER10
 - improved the performance of SGEMV
 - added vectorized implementations of SBGEMV and support for forwarding 1xN SBGEMM to them
 - fixed illegal instructions and potential memory overflow in SGEMM on PPCG4
 - fixed handling of NaN and Inf arguments in SSCAL and DSCAL on PPC440,G4 and 970
 - added improved CGEMM and ZGEMM kernels for POWER10
 - added Makefile logic to remove all optimization flags in DEBUG builds

mips64:
 - fixed compilation with gcc14
 - fixed GEMM parameter selection for the MIPS64_GENERIC target
 - fixed a potential build failure when compiling with OpenMP

loongarch64:
 - fixed compilation for Loongson3 with recent versions of gmake
 - fixed a potential loss of precision in Loongson3A GEMM
 - fixed a potential build failure when compiling with OpenMP
 - added optimized SOMATCOPY for LASX-capable targets
 - introduced a new cpu naming scheme while retaining compatibility
 - added support for cross-compiling Loongarch64 targets with CMake
 - added support for compilation with LLVM

riscv64:
 - removed thread yielding overhead caused by sched_yield
 - replaced some non-standard intrinsics with their official names
 - fixed and sped up the implementations of CGEMM/ZGEMM TCOPY for vector lenghts 128 and 256
 - improved the performance of SNRM2/DNRM2 for RVV1.0 targets
 - added optimized ?OMATCOPY_CN kernels for RVV1.0 targets

====================================================================
Version 0.3.28
 8-Aug-2024

general:
- Reworked the unfinished implementation of HUGETLB from GotoBLAS
  for allocating huge memory pages as buffers on suitable systems
- Changed the unfinished implementation of GEMM3M for the generic
  target on all architectures to at least forward to regular GEMM
- Improved multithreaded GEMM performance for large non-skinny matrices
- Improved BLAS3 performance on larger multicore systems through improved
  parallelism
- Improved performance of the initial memory allocation by reducing
  locking overhead
- Improved performance of GBMV at small problem sizes by introducing
  a size barrier for the switch to multithreading
- Added an implementation of the CBLAS_GEMM_BATCH extension
- Fixed miscompilation of CAXPYC and ZAXPYC on all architectures in
  CMAKE builds (error introduced in 0.3.27)
- Fixed corner cases involving the handling of NAN and INFINITY
  arguments in ?SCAL on all architectures
- Added support for cross-compiling to WEBM with CMAKE (in addition
  to the already present makefile support)
- Fixed NAN handling and potential accuracy issues in compilations with
  Intel ICX by supplying a suitable fp-model option by default
- The contents of the github project wiki have been converted into
  a new set of documentation included with the source code.
- It is now possible to register a callback function that replaces
  the built-in support for multithreading with an external backend
  like TBB (openblas_set_threads_callback_function)
- Fixed potential duplication of suffixes in shared library naming
- Improved C compiler detection by the build system to tolerate more
  naming variants for gcc builds
- Fixed an unnecessary dependency of the utest on CBLAS
- Fixed spurious error reports from the BLAS extensions utest
- Fixed unwanted invocation of the GEMM3M tests in cross-compilation
- Fixed a flaw in the makefile build that could lead to the pkgconfig
  file containing an entry of UNKNOWN for the target cpu after installing
- Integrated fixes from the Reference-LAPACK project:
  - Fixed uninitialized variables in the LAPACK tests for ?QP3RK (PR 961)
  - Fixed potential bounds error in ?UNHR_COL/?ORHR_COL (PR 1018)
  - Fixed potential infinite loop in the LAPACK testsuite (PR 1024)
  - Make the variable type used for hidden length arguments configurable (PR 1025)
  - Fixed SYTRD workspace computation and various typos (PR 1030)
  - Prevent compiler use of FMA that could increase numerical error in ?GEEVX (PR 1033)

x86-64:
- reverted thread management under Windows to its state before 0.3.26
  due to signs of race conditions in some circumstances now under study
- fixed accidental selection of the unoptimized generic SBGEMM kernel
  in CMAKE builds for CooperLake and SapphireRapids targets
- fixed a potential thread buffer overrun in SBSTOBF16 on small systems
- fixed an accuracy issue in ZSCAL introduced in 0.3.26
- fixed compilation with CMAKE and recent releases of LLVM
- added support for Intel Emerald Rapids and Meteor Lake cpus
- added autodetection support for the Zhaoxin KX-7000 cpu
- fixed autodetection of Intel Prescott (probably broken since 0.3.19)
- fixed compilation for older targets with the Yocto SDK
- fixed compilation of the converter-generated C versions
  of the LAPACK sources with gcc-14
- improved compiler options when building with CMAKE and LLVM for
  AVX512-capable targets
- added support for supplying the L2 cache size via an environment
  variable (OPENBLAS_L2_SIZE) in case it is not correctly reported
  (as in some VM configurations)
- improved the error message shown when thread creation fails on startup
- fixed setting the rpath entry of the dylib in CMAKE builds on MacOS

arm:
- fixed building for baremetal targets with make

arm64:
- Added a fast path forwarding SGEMM and DGEMM calls with a 1xN or Mx1
  matrix to the corresponding GEMV kernel
- added optimized SGEMV and DGEMV kernels for A64FX
- added optimized SVE kernels for small-matrix GEMM
- added A64FX to the cpu list for DYNAMIC_ARCH
- fixed building with support for cpu affinity
- worked around accuracy problems with C/ZNRM2 on NeoverseN1 and
  Apple M targets
- improved GEMM performance on Neoverse V1
- fixed compilation for NEOVERSEN2 with older compilers
- fixed potential miscompilation of the SVE SDOT and DDOT kernels
- fixed potential miscompilation of the non-SVE CDOT and ZDOT kernels
- fixed a potential overflow when using very large user-defined BUFFERSIZE
- fixed setting the rpath entry of the dylib in CMAKE builds on MacOS

power:
- Added a fast path forwarding SGEMM and DGEMM calls with a 1xN or Mx1
  matrix to the corresponding GEMV kernel
- significantly improved performance of SBGEMM on POWER10
- fixed compilation with OpenMP and the XLF compiler
- fixed building of the BLAS extension utests under AIX
- fixed building of parts of the LAPACK testsuite with XLF
- fixed CSWAP/ZSWAP on big-endian POWER10 targets
- fixed a performance regression in SAXPY on POWER10 with OpenXL
- fixed accuracy issues in CSCAL/ZSCAL when compiled with LLVM
- fixed building for POWER9 under FreeBSD
- fixed a potential overflow when using very large user-defined BUFFERSIZE
- fixed an accuracy issue in the POWER6 kernels for GEMM and GEMV

riscv64:
- Added a fast path forwarding SGEMM and DGEMM calls with a 1xN or Mx1
  matrix to the corresponding GEMV kernel
- fixed building for RISCV64_GENERIC with OpenMP enabled
- added DYNAMIC_ARCH support (comprising GENERIC_RISCV64 and the two
  RVV 1.0 targets with vector length of 128 and 256)
- worked around the ZVL128B kernels for AXPBY mishandling the special
  case of zero Y increment

loongarch64:
- improved GEMM performance on servers of the 3C5000 generation
- improved performance and stability of DGEMM
- improved GEMV and TRSM kernels for LSX and LASX vector ABIs
- fixed CMAKE compilation with the INTERFACE64 option set
- fixed compilation with CMAKE
- worked around spurious errors flagged by the BLAS3 tests
- worked around a miscompilation of the POTRS utest by gcc 14.1

mips64:
- fixed ASUM and SUM kernels to accept negative step sizes in X
- fixed complex GEMV kernels for MSA

====================================================================
Version 0.3.27
 4-Apr-2024

general:
- added initial (generic) support for the CSKY architecture
- capped the maximum number of threads used in GEMM, GETRF and POTRF to avoid creating
  underutilized or idle threads
- sped up multithreaded POTRF on all platforms
- added extension openblas_set_num_threads_local() that returns the previous thread count
- re-evaluated the SGEMV and DGEMV load thresholds to avoid activating multithreading
  for too small workloads
- improved the fallback code used when the precompiled number of threads is exceeded,
  and made it callable multiple times during the lifetime of an instance
- added CBLAS interfaces for the BLAS extensions ?AMIN,?AMAX, CAXPYC and ZAXPYC
- fixed a potential buffer overflow in the interface to the GEMMT kernels
- fixed use of incompatible pointer types in GEMMT and C/ZAXPBY as flagged by GCC-14
- fixed unwanted case sensitivity of the character parameters in ?TRTRS
- sped up the OpenMP thread management code
- fixed sizing of logical variables in INTERFACE64 builds of the C version of LAPACK
- fixed inclusion of new LAPACK and LAPACKE functions from LAPACK 3.11 in the shared library
- added a testsuite for the BLAS extensions
- modified the error thresholds for SGS/DGS functions in the LAPACK testsuite to suppress
  spurious errors
- added support for building the benchmark collection with CMAKE
- added rewriting of linker options to avoid linking both libgomp and libomp in CMAKE builds
  with OpenMP enabled that use clang with gfortran
- fixed building on systems with ucLibc
- added support for calling ?NRM2 with a negative increment value on all architectures
- added support for the LLVM18 version of the flang-new compiler
- fixed handling of the OPENBLAS_LOOPS variable in several benchmarks
- Integrated fixes from the Reference-LAPACK project:
  - Increased accuracy in C/ZLARFGP (Reference-LAPACK PR 981)

x86:
- fixed handling of NaN and Inf arguments in ZSCAL
- fixed GEMM3M functions failing in CMAKE builds

x86-64:
- removed all instances of sched_yield() on Linux and BSD
- fixed a potential deadlock in the thread server on MSWindows (introduced in 0.3.26)
- fixed GEMM3M functions failing in CMAKE builds
- fixed handling of NaN and Inf arguments in ZSCAL
- added compiler checks for AVX512BF16 compatibility
- fixed LLVM compiler options for Sapphire Rapids
- fixed cpu handling fallbacks for Sapphire Rapids with
  disabled AVX2 in DYNAMIC_ARCH mode
- fixed extensions SCSUM and DZSUM
- improved GEMM performance for ZEN targets

arm:
- fixed handling of NaN and Inf arguments in ZSCAL

arm64:
- added initial support for the Cortex-A76 cpu
- fixed handling of NaN and Inf arguments in ZSCAL
- fixed default compiler options for gcc (-march and -mtune)
- added support for ArmCompilerForLinux
- added support for the NeoverseV2 cpu in DYNAMIC_ARCH builds
- fixed mishandling of the INTERFACE64 option in CMAKE builds
- corrected SCSUM kernels (erroneously duplicating SCASUM behaviour)
- added SVE-enabled kernels for CSUM/ZSUM
- worked around an inaccuracy in the NRM2 kernels for NeoverseN1 and Apple M

power:
- improved performance of SGEMM on POWER8/9/10
- improved performance of DGEMM on POWER10
- added support for OpenMP builds with xlc/xlf on AIX
- improved cpu autodetection for DYNAMIC_ARCH builds on older AIX
- fixed cpu core counting on AIX
- added support for building a shared library on AIX

riscv64:
- added support for the X280 cpu
- added support for semi-generic RISCV models with vector length 128 or 256
- added support for compiling with either RVV 0.7.1 or RVV 1.0 standard compilers
- fixed handling of NaN and Inf arguments in ZSCAL
- improved cpu model autodetection
- fixed corner cases in ?AXPBY for C910V
- fixed handling of zero increments in ?AXPY kernels for C910V

loongarch64:
- added optimized kernels for ?AMIN and ?AMAX
- fixed handling of NaN and Inf arguments in ZSCAL
- fixed handling of corner cases in ?AXPBY
- fixed computation of SAMIN and DAMIN in LSX mode
- fixed computation of ?ROT
- added optimized SSYMV and DSYMV kernels for LSX and LASX mode
- added optimized CGEMM and ZGEMM kernels for LSX and LASX mode
- added optimized CGEMV and ZGEMV kernels

mips:
- fixed utilizing MSA on P5600 and related cpus (broken in 0.3.22)
- fixed handling of NaN and Inf arguments in ZSCAL
- fixed mishandling of the INTERFACE64 option in CMAKE builds

zarch:
- fixed handling of NaN and Inf arguments in ZSCAL
- fixed calculation of ?SUM on Z13

====================================================================
Version 0.3.26
 2-Jan-2024

general:
- improved the version of openblas.pc that is created by the CMAKE build
- fixed a CMAKE-specific build problem on older versions of MacOS
- worked around linking problems on old versions of MacOS
- corrected installation location of the lapacke_mangling header in CMAKE builds
- added type declarations for complex variables to the MSVC-specific parts of the LAPACK header
- significantly sped up ?GESV for small problem sizes by introducing a lower bound for multithreading
- imported additions and corrections from the Reference-LAPACK project:
  - added new LAPACK functions for truncated QR with pivoting (Reference-LAPACK PRs 891&941)
  - handle miscalculation of minimum work array size in corner cases (Reference-LAPACK PR 942)
  - fixed use of uninitialized variables in ?GEDMD and improved inline documentation (PR 959)
  - fixed use of uninitialized variables (and consequential failures) in ?BBCSD (PR 967)
  - added tests for the recently introduced Dynamic Mode Decomposition functions (PR 736)
  - fixed several memory leaks in the LAPACK testsuite (PR 953)
  - fixed counting of testsuite results by the Python script (PR 954)

x86-64:
- fixed computation of CASUM on SkylakeX and newer targets in the special
  case that AVX512 is not supported by the compiler or operating environment
- fixed potential undefined behaviour in the CASUM/ZASUM kernels for AVX512 targets
- worked around a problem in the pre-AVX kernels for GEMV
- sped up the thread management code on MS Windows

arm64:
- fixed building of the LAPACK testsuite with Xcode 15 on Apple M1 and newer
- sped up the thread management code on MS Windows
- sped up SGEMM and DGEMM on Neoverse V1 and N1
- sped up ?DOT on SVE-capable targets
- reduced the number of targets in DYNAMIC_ARCH builds by eliminating functionally equivalent ones
- included support for Apple M1 and newer targets in DYNAMIC_ARCH builds

power:
- improved the SGEMM kernel for POWER10
- fixed compilation with (very) old versions of gcc
- fixed detection of old 32bit PPC targets in CMAKE-based builds
- added autodetection of the POWERPC 7400 subtype
- fixed CMAKE-based compilation for PPCG4 and PPC970 targets

loongarch64:
- added and improved optimized kernels for almost all BLAS functions

====================================================================
Version 0.3.25
 12-Nov-2023

general:
- improved the error message shown on exceeding the maximum thread count
- improved the code to add supplementary thread buffers in case of overflow
- fixed a potential division by zero in ?ROTG
- improved the ?MATCOPY functions to accept zero-sized rows or columns
- corrected empty prototypes in function declarations
- cleaned up unused declarations in the f2c-converted versions of the LAPACK sources
- fixed compilation with the Cray CCE Compiler suite
- improved link line rewriting to avoid mixed libgomp/libomp builds with clang&gfortran
- worked around OPENMP builds with LLVM14's libomp hanging on FreeBSD
- improved the Makefiles to require less option duplication on "make install"
- imported the following changes from the upcoming release 3.12 of Reference-LAPACK
  - deprecate utility functions ?GELQS and ?GEQRS (LAPACK PR 900)
  - apply rounding up to workspace calculations done in floating point (LAPACK PR 904)
  - avoid overflow in STGEX2/DTGEX2 (LAPACK PR 907)
  - fix accumulation in ?LASSQ (LAPACK PR 909)
  - fix handling of NaN values in ?GECON (LAPACK PR 926)
  - avoid overflow in CBDSQR/ZBDSQR (LAPACK PR 927)
  - fix poor vector orthogonalizations in ?ORBDB5/?UNBDB5 (LAPACK PR 928 & 930)

x86-64:
- fixed compile-time autodetection of AMD Ryzen3 and Ryzen4 cpus
- fixed capability-based fallback selection for unknown cpus in DYNAMIC_ARCH
- added AVX512 optimizations for ?ASUM on Sapphire Rapids and Cooper Lake

ARM64:
- fixed building on Apple with homebrew gcc
- fixed building with XCODE 15
- fixed building on A64FX and Cortex A710/X1/X2
- increased the default buffer size for recent ARM server cpus

POWER:
- fixed building with the IBM xlf 16.1.1 compiler
- fixed building with IBM XL C
- added support for DYNAMIC_ARCH builds with clang
- fixed union declaration in the BFLOAT16 test case
- enable optimizations for the AIX assembler on POWER10

LOONGARCH64:
- added an optimized SGEMV kernel
- added an optimized DTRSM kernel

====================================================================
Version 0.3.24
 03-Sep-2023

general:
   - declared the arguments of cblas_xerbla as const (in accordance with the reference implementation
     and others, the previous discrepancy appears to have dated back to GotoBLAS)
   - fixed the implementation of ?GEMMT that was added in 0.3.23
   - made cpu-specific SWITCH_RATIO parameters for GEMM available to DYNAMIC_ARCH builds
   - fixed application of SYMBOLSUFFIX in CMAKE builds
   - fixed missing SSYCONVF function in the shared library
   - fixed parallel build logic used with gmake
   - added support for compilation with LLVM17, in particular its new Fortran compiler
   - added support for CMAKE builds using the NVIDIA HPC compiler
   - fixed INTERFACE64 builds with CMAKE and the f95 Fortran compiler
   - fixed cross-build detection and management in c_check
   - disabled building of the tests with CMAKE when ONLY_CBLAS is defined
   - fixed several issues with the handling of runtime limits on the number of OPENMP threads
   - corrected the error code returned by SGEADD/DGEADD when LDA is too small
   - corrected the error code returned by IMATCOPY when LDB is too small
   - updated ?NRM2 to support negative increment values (as introduced in release 3.10
     of the reference BLAS)
   - fixed OpenMP builds with CLANG for the case where libomp is not in a standard location
   - fixed a potential overwrite of unrelated memory during thread initialisation on startup
   - fixed a potential integer overflow in the multithreading threshold for ?SYMM/?SYRK
   - fixed build of the LAPACKE interfaces for the LAPACK 3.11.0 ?TRSYL functions added in 0.3.22
   - fixed installation of .cmake files in concurrent 32 and 64bit builds with CMAKE
   - applied additions and corrections from the development branch of Reference-LAPACK:
   - fixed actual arguments passed to a number of LAPACK functions (from Reference-LAPACK PR 885)
   - fixed workspace query results in LAPACK ?SYTRF/?TRECV3 (from Reference-LAPACK PR 883)
   - fixed derivation of the UPLO parameter in LAPACKE_?larfb (from Reference-LAPACK PR 878)
   - fixed a crash in LAPACK ?GELSDD on NRHS=0 (from Reference-LAPACK PR 876)
   - added new LAPACK utility functions CRSCL and ZRSCL (from Reference-LAPACK PR 839)
	- corrected the order of eigenvalues for 2x2 matrices in ?STEMR (Reference-LAPACK PR 867)
	- removed spurious reference to OpenMP variables outside OpenMP contexts (Reference-LAPACK PR 860)
	- updated file comments on use of LAMBDA variable in LAPACK (Reference-LAPACK PR 852)
	- fixed documentation of LAPACK SLASD0/DLASD0 (Reference-LAPACK PR 855)
	- fixed confusing use of "minor" in LAPACK documentation (Reference-LAPACK PR 849)
	- added new LAPACK functions ?GEDMD for dynamic mode decomposition (Reference-LAPACK PR 736)
	- fixed potential stack overflows in the EIG part of the LAPACK testsuite (Reference-LAPACK PR 854)
	- applied small improvements to the variants of Cholesky and QR functions (Reference-LAPACK PR 847)
	- removed unused variables from LAPACK ?BDSQR (Reference-LAPACK PR 832)
	- fixed a potential crash on allocation failure in LAPACKE SGEESX/DGEESX (Reference-LAPACK PR 836)
	- added a quick return from SLARUV/DLARUV for N < 1 (Reference-LAPACK PR 837)
	- updated function descriptions in LAPACK ?GEGS/?GEGV (Reference-LAPACK PR 831)
	- improved algorithm description in ?GELSY (Reference-LAPACK PR 833)
	- fixed scaling in LAPACK STGSNA/DTGSNA (Reference-LAPACK PR 830)
	- fixed crash in LAPACKE_?geqrt with row-major data (Reference-LAPACK PR 768)
	- added LAPACKE interfaces for C/ZUNHR_COL and S/DORHR_COL (Reference-LAPACK PR 827)
	- added error exit tests for SYSV/SYTD2/GEHD2 to the testsuite (Reference-LAPACK PR 795)
	- fixed typos in LAPACK source and comments (Reference-LAPACK PRs 809,811,812,814,820)
	- adopt refactored ?GEBAL implementation (Reference-LAPACK PR 808)

x86_64:
   - added cpu model autodetection for Intel Alder Lake N
   - added activation of the AMX tile to the Sapphire Rapids SBGEMM kernel
   - worked around miscompilations of GEMV/SYMV kernels by gcc's tree-vectorizer
   - fixed compilation of Cooperlake and Sapphire Rapids kernels with CLANG
   - fixed runtime detection of Cooperlake and Sapphire Rapids in DYNAMIC_ARCH
   - fixed feature-based cputype fallback in DYNAMIC_ARCH
   - added support for building the AVX512 kernels with the NVIDIA HPC compiler
   - corrected ZAXPY result on old pre-AVX hardware for the INCX=0 case
   - fixed a potential use of uninitialized variables in ZTRSM

ARM64:
   - added cpu model autodetection for Apple M2
   - fixed wrong results of CGEMM/CTRMM/DNRM2 under OSX (use of reserved register)
   - added support for building the SVE kernels with the NVIDIA HPC compiler
   - added support for building the SVE kernels with the Apple Clang compiler
   - fixed compiler option handling for building the SVE kernels with LLVM
   - implemented SWITCH_RATIO parameter for improved GEMM performance on Neoverse
   - activated SVE SGEMM and DGEMM kernels for Neoverse V1
   - improved performance of the SVE CGEMM and ZGEMM kernels on Neoverse V1
   - improved kernel selection for the ARMV8SVE target and added it to DYNAMIC_ARCH
   - fixed runtime check for SVE availability in DYNAMIC_ARCH builds to take OS or
     container restrictions into account
   - fixed a potential use of uninitialized variables in ZTRSM
   - fix a potential misdetection of ARMV8 hardware as 32bit in CMAKE builds

LOONGARCH64:
   - added ABI detection
   - added support for cpu affinity handling
   - fixed compilation with early versions of the Loongson toolchain
   - added an optimized SGEMM kernel for 3A5000
   - added optimized DGEMV kernels for 3A5000
   - improved the performance of the DGEMM kernel for 3A5000

MIPS64:
   - fixed miscompilation of TRMM kernels for the MIPS64_GENERIC target

POWER:
   - fixed compiler warnings in the POWER10 SBGEMM kernel

RISCV:
   - fixed application of the INTERFACE64 option when building with CMAKE
   - fix a potential misdetection of RISCV hardware as 32bit in CMAKE builds
   - fixed IDAMAX and DOT kernels for C910V
   - fixed corner cases in the ROT and SWAP kernels for C910V
   - fixed compilation of the C910V target with recent vendor compilers

====================================================================
Version 0.3.23
 01-Apr-2023

 general:
   - fixed a serious regression in GETRF/GETF2 and ZGETRF/ZGETF2 where
     subnormal but nonzero data elements triggered the singularity flag
   - fixed a long-standing bug in CSPR/ZSPR in single-threaded operation
     for cases where elements of the X vector are real numbers (or
     complex with only the real part zero)
   - fixed gmake builds with the option NO_LAPACK
   - fixed a few instances in the gmake Makefiles where expressly
     setting NO_LAPACK=0 or NO_LAPACKE=0 would have the opposite effect

x86_64:
   - added further CPUID values for Intel Raptor Lake

====================================================================
Version 0.3.22
 26-Mar-2023

general:
 - Updated the included LAPACK to Reference-LAPACK release 3.11.0
   plus post-release corrections and improvements
 - Added initial support for processing with the EMSCRIPTEN javascript
   converter (yielding a single-threaded build only)
 - Added a threshold for multithreading in SYMM, SYMV and SYR2K
 - Increased the threshold for multithreading in SYRK
 - OpenBLAS no longer decreases the global OMP_NUM_THREADS when it
   exceeds the maximum thread count the library was compiled for.
 - fixed ?GETF2 potentially returning NaN with tiny matrix elements
 - fixed openblas_set_num_threads to work in USE_OPENMP builds
 - fixed cpu core counting in USE_OPENMP builds returning the number
   of OMP "places" rather than cores
 - fixed interpretation of USE_PERL=0 in build scripts
 - fixed linking of the library with libm in CMAKE builds
 - fixed startup delays resulting from a wrong default setting of
   NO_WARMUP in CMAKE builds
 - fixed inconsistent defaults for overriding of LAPACK SPMV, SPR,
   SYMV, SYR functions in gmake and CMAKE builds
 - fixed stride calculation in the optimized small-matrix path of
   complex SYR
 - fixed compilation of ReLAPACK with CMAKE
 - fixed pkgconfig file contents for INTERFACE64 builds
 - fixed building of Reference-LAPACK with recent gfortran
 - fixed building with only a subset of precision types on Windows
 - added new environment variable OPENBLAS_DEFAULT_NUM_THREADS
 - added a GEMV-based implementation of GEMMT
 - added support for building under QNX
 - updated support for (cross-)building for ALPHA targets

x86_64:
 - added autodetection of Intel Raptor Lake cpu models
 - added SSCAL microkernels for Haswell and newer targets
 - improved the performance of the Haswell DSCAL microkernel
 - added CSCAL and ZSCAL microkernels for SkylakeX targets
 - fixed detection of gfortran and Cray CCE compilers
 - fixed detection of recent versions of the Intel Fortran compiler
 - fixed compilation with LLVM to no longer run out of AVX512 registers
 - fix cpu type option setting with recent NVIDIA HPC compiler versions
 - fixed compilation for/on AMD Ryzen 4 cpus
 - fixed compilation of AVX2-capable targets with Apple Clang
 - fixed runtime selection of COOPERLAKE in DYNAMIC_ARCH builds
 - worked around gcc/llvm using risky FMA operations in CSCAL/ZSCAL
 - worked around miscompilations of GEMV, SYMV and ZDOT kernels
   by gcc12's tree-vectorizer on OSX and Windows

ARM:
 - fixed cross-compilation to ARMV5 and ARMV6 targets with CMAKE

ARMV8:
 - fixed cross-compilation to CortexA53 with CMAKE
 - fixed compilation with CMAKE and "Arm Compiler for Linux 22.1"
 - added cpu autodetection for Cortex X3 and A715
 - fixed conditional compilation of SVE-capable targets in DYNAMIC_ARCH
 - sped up SVE kernels by removing unnecessary prefetches
 - improved the GEMM performance of Neoverse V1
 - added SVE kernels for SDOT and DDOT
 - added an SBGEMM kernel for Neoverse N2
 - improved cpu-specific compiler option selection for Neoverse cpus
 - added support for setting CONSISTENT_FPCSR

MIPS64:
 - improved MSA capability detection and handling
 - added a MIPS64_GENERIC build target
 - fixed corner cases in DNRM2

LOONGARCH64:
 - fixed handling of the INTERFACE64 option

RISCV:
 - fixed handling of the INTERFACE64 option

====================================================================
Version 0.3.21
 07-Aug-2022

general:
 - Updated the included LAPACK to Reference-LAPACK release 3.10.1
 - when no Fortran compiler is available, OpenBLAS builds will now automatically
   build LAPACK from an f2c-converted copy of LAPACK 3.9.0 unless the NO_LAPACK option
   is specified
 - similarly added C versions of the BLAS and CBLAS tests
 - enabled building of the ReLAPACK GEMMT kernels when ReLAPACK is built
 - function LAPACKE_lsame is now annotated with the GCC attribute "const" to aid static analyzers
 - added USE_TLS to the list of options reported by the openblas_get_config() function
 - CMAKE builds now support the BUILD_TESTING keyword (to disable the LAPACK testsuite) of Reference-LAPACK
 - fixed CMAKE builds of the laswp_ncopy and neg_tcopy kernels
 - removed the build system requirements for PERL (while keeping the original perl scripts as backup)
 - handle building and running OpenBLAS on systems that report zero available cpu cores
 - added SYMBOLPREFIX/SYMBOLSUFFIX handling for LAPACK 3.10.0 functions added in 0.3.20
 - fixed linking of the utests on QNX
 - Added support for compilation with the Intel ifx compiler
 - Added support for compilation with the Fujitsu FCC compiler for Fugaku
 - Added support for compilation with the Cray C and Fortran compilers
 - reverted OpenMP threadpool behaviour in the exec_blas call to its state before 0.3.11, that is
   the threadpool will no longer grow or shrink on demand as the overhead for this is too big at least with
   GNU OpenMP. The adaptive behaviour introduced in 0.3.11 can still be requested at runtime by setting
   the environment variable OMP_ADAPTIVE
 - worked around spurious STFSM/CTFSM errors reported by the LAPACK testsuite

x86_64:
 - fixed determination of compiler support for AVX512 and removed the 0.3.19
   workaround for building SKYLAKEX kernels on Sandybridge hardware
 - fixed compilation for the SKYLAKEX target with gcc 6
 - fixed compilation of the CooperLake SBGEMM kernel with LLVM
 - fixed compilation of the SkyLakeX small matrix GEMM kernels with LLVM or ICC
 - fixed compilation of some BFLOAT16 kernels with CMAKE
 - added support for the Zhaoxin/Centaur KH40000 cpu
 - fixed a potential crash in the ZSYMV kernel used for all targets except generic
 - fixed gmake compilation for DYNAMIC_ARCH with a DYNAMIC_LIST including ATOM
 - fixed compilation of LAPACKE with the INTEGER64 option on Windows
 - added support for cross-compiling to individual Intel or AMD targets using CMAKE
   (previously only CORE2 supported, added targets are ATOM, PRESCOTT, NEHALEM, SANDYBRIDGE,
   HASWELL,SKYLAKEX, COOPERLAKE, SAPPHIRERAPIDS, OPTERON, BARCELONA, BULLDOZER, PILEDRIVER,
   STEAMROLLER,EXCAVATOR, ZEN)

SPARC:
 - worked around an overflow error in the DNRM2 kernel

POWER:
 - worked around an overflow error in the POWER6 DNRM2 kernel
 - fixed compilation on PPC440
 - fixed a performance regression in the level1 BLAS on POWER10
 - fixed the POWER10 ZGEMM kernel
 - fixed singlethreaded builds for POWER10
 - fixed compilation of the POWER10 DGEMV kernel with older gcc versions
 - enabled compilation of the BFLOAT16 kernels by default
 - enabled the small matrix kernels by default for DYNAMIC_ARCH builds
 - added a workaround for a miscompilation of the CDOT and ZDOT kernels by GCC 12

- RISCV:
 - fixed cpu autodetection logic

ARMV8:
 - added an SBGEMM kernel for Neoverse N2
 - worked around an overflow error in the DNRM2 kernel used on M1, NeoverseN1, ThunderX2T99
 - added support for ARM64 systems running MS Windows
 - added support for cross-compiling to the GENERIC ARMV8 target under CMAKE (Windows/MSVC)
 - fixed a performance regression in the generic ARMV8 DGEMM kernel introduced in 0.3.19
 - added initial support for the Apple M1 cpu under Linux
 - added initial support for the Phytium FT2000 cpu
 - added initial support for the Cortex A510, A710, X1 and X2 cpu
 - fixed an accidental mixup of cpu identifiers in the autodetection code introduced in 0.3.20
 - fixed linking of Apple M1 builds on macOS 12 and later with recent XCode
 - made Neoverse N2 available in DYNAMIC_ARCH builds

MIPS,MIPS64:
 - worked around an overflow error in the DNRM2 kernel

LOONGARCH64:
 - worked around an overflow error in the DNRM2 kernel
 - added preliminary support for the LOONGSON2K1000 cpu
 - added DYNAMIC_ARCH support

====================================================================
Version 0.3.20
 20-Feb-2022

general:
 - some code cleanup, with added casts etc.
 - fixed obtaining the cpu count with OpenMP and OMP_PROC_BIND unset
 - fixed pivot index calculation by ?LASWP for negative increments other than one
 - fixed input argument check in LAPACK ? GEQRT2
 - improved the check for a Fortran compiler in CMAKE builds
 - disabled building OpenBLAS' optimized versions of LAPACK complex SPMV,SPR,SYMV,SYR with NO_LAPACK=1
 - fixed building of LAPACK on certain distributed filesystems with parallel gmake
 - fixed building the shared library on MacOS with classic flang

x86_64:
 - fixed cross-compilation with CMAKE for CORE2 target
 - fixed miscompilation of AVX512 code in DYNAMIC_ARCH builds
 - added support for the "incidental" AVX512 hardware in Alder Lake when enabled in BIOS

E2K:
 - add new architecture (Russian Elbrus E2000 family)

SPARC:
 - fix IMIN/IMAX

ARMV8:
 - added SVE-enabled CGEMM and ZGEMM kernels for ARMV8SVE and A64FX
 - added support for Neoverse N2 and V1 cpus

MIPS,MIPS64:
 - fixed autodetection of MSA capability

LOONGARCH64:
 - added an optimized DGEMM kernel

====================================================================
Version 0.3.19
 19-Dec-2021

 general:
 - reverted unsafe TRSV/ZRSV optimizations introduced in 0.3.16
 - fixed a potential thread race in the thread buffer reallocation routines
   that were introduced in 0.3.18
 - fixed miscounting of thread pool size on Linux with OMP_PROC_BIND=TRUE
 - fixed CBLAS interfaces for CSROT/ZSROT and CROTG/ZROTG
 - made automatic library suffix for CMAKE builds with INTERFACE64 available
   to CBLAS-only builds

x86_64:
 - DYNAMIC_ARCH builds now fall back to the cpu with most similar capabilities
   when an unknown CPUID is encountered, instead of defaulting to Prescott
 - added cpu detection for Intel Alder Lake
 - added cpu detection for Intel Sapphire Rapids
 - added an optimized SBGEMM kernel for Sapphire Rapids
 - fixed DYNAMIC_ARCH builds on OSX with CMAKE
 - worked around DYNAMIC_ARCH builds made on Sandybridge failing on SkylakeX
 - fixed missing thread initialization for static builds on Windows/MSVC
 - fixed an excessive read in ZSYMV

POWER:
 - added support for POWER10 in big-endian mode
 - added support for building with CMAKE
 - added optimized SGEMM and DGEMM kernels for small matrix sizes

ARMV8: