DRAFT Disk usage CI by roigcarlo · Pull Request #11566 · KratosMultiphysics/Kratos

roigcarlo · 2023-09-14T13:22:43Z

No description provided.

philbucher · 2023-09-14T14:56:57Z

.github/workflows/ci.yml


 jobs:
+  # Free disk space from the github ubuntu image, as is very bloated (Android and DotNet are safe and about ~11G)
+  # Task is overlaped with other runs but should not cause a problem


instead I would add needs: free-disk to ensure that this job is run before the others
otherwise if for whatever reason this one does not start, then the others might randomly fail

see https://docs.github.com/en/actions/using-workflows/about-workflows#creating-dependent-jobs

Sadly is more complex...

It appears that every job runs in an independent worker ( which is tied to a different ubuntu-latest image) which means that changes in the worker do not propagate to other jobs, being dependents or not.

So basically when I run this:
1 - Workflow starts
2 - free-disk and ubuntu-fulldbg-gcc runs, each in a different worker
3 - free-disk ends and its worker ends with it, so no changes in disk

With the needs:
1 - Workflow starts
2 - free-disk starts, with its assigned worker, ubuntu-fulldbg-gcc gets into queue
3 - free-disk ends and its worker ends with it. ubuntu-fulldbg-gcc starts with another worker.
4 - We end up in the same place.

By manually deleting those libs the problem its solved, but there is a detail which is preventing it to work. The rationale behind separating the free-disk step into a different job, is that compilation jobs are running inside custom containers, so I don't have direct access to the worker.

As the jobs do not share worker, and the jobs that need more space run in a custom container, it is impossible to liberate space normally.

At this points I am trying several solutions:
1 - Trim the KratosUbuntu-22.04 image with less software ( difficult, as the main offender here is Intel's OneApi 8G, 90% of the image size, and we need it )
2 - Divide the compilation in substeps and clear the intermediate compilation artifacts
3 - Manually mount the docker image and create a custom entry point that executes. Feasible but a tone of work and will prevent us from splinting the job in different steps.
4 - Mounting the host runner directories that we want to delete in the container, and delete its contents. ( God forgive me for suggesting this)
5 - Pray for github to implement pre-step and post-step. (Spoiler: not gonna happen)

I am trying to implement 2, but if not the case we will have to think the whole process differently....

Ah so it is kinda a coincidence that the current solution allows to run the free-disk job on the same worker?

Regarding your options, I already optimized the image quite a bit regarding size

Also I tried to use custom entry points some time ago and it was not (yet?) supported

So if it works, then kudos for figuring it out

No no, this solution does not allow for that, they are different workers ( is still failing in this PR)

Custom entry points are ok now, but then you can only do 1 job, or multiple but then you have to handle the volumes manually to transfer data and turn up the container for each job, so I assume it will end up being slow... but yea, its possible now and that's my backup solution if I cannot reorganize the compilation so we can delete artifacts while compiling.

We could split the image into two, one base image with all the compilers and one with the intel stuff
The intel build only has the core atm, which takes less space

Only downside is that the gcc builds would not have Pardiso enabled

Also there are multiple versions of mmg and parmmg installed

@loumalouomega @marcnunezc can we remove the older versions to safe space?

Hi!

Currently, the unused versions are (which can be safely removed):

Kratos/scripts/docker_files/docker_file_ci_ubuntu_22_04/DockerFile

Lines 67 to 74 in f72bc3d

# install MMG v5.6.0

git clone -b 'v5.6.0' --depth 1 https://github.com/MmgTools/mmg /tmp/mmg_5_6_0 && \

mkdir /tmp/mmg_5_6_0/build && \

mkdir -p /external_libraries/mmg/mmg_5_6_0 && \

cd /tmp/mmg_5_6_0/build && \

cmake .. -DCMAKE_RULE_MESSAGES=OFF -DCMAKE_C_FLAGS="-w" -DCMAKE_CXX_FLAGS="-w" -DCMAKE_INSTALL_PREFIX="/external_libraries/mmg/mmg_5_6_0" -DUSE_SCOTCH=OFF -DLIBMMG3D_SHARED=ON -DLIBMMG2D_SHARED=ON -DLIBMMGS_SHARED=ON -DLIBMMG_SHARED=ON && \

make -j2 install && \

cd / && \

Kratos/scripts/docker_files/docker_file_ci_ubuntu_22_04/DockerFile

Lines 85 to 94 in f72bc3d

# install ParMmg v1.4.0

git clone -b 'v1.4.0' --depth 1 https://github.com/MmgTools/ParMmg /tmp/ParMmg_1_4_0 && \

mkdir /tmp/ParMmg_1_4_0/build && \

mkdir -p /external_libraries/ParMmg_1_4_0 && \

cd /tmp/ParMmg_1_4_0/build && \

cmake .. -DCMAKE_RULE_MESSAGES=OFF -DCMAKE_C_FLAGS="-w" -DCMAKE_CXX_FLAGS="-w" -DCMAKE_INSTALL_PREFIX="/external_libraries/ParMmg_1_4_0" -DUSE_SCOTCH=OFF -DLIBPARMMG_SHARED=ON -DDOWNLOAD_MMG=OFF -DMMG_DIR="/tmp/mmg_5_6_0" -DMMG_BUILDDIR="/tmp/mmg_5_6_0/build" -DDOWNLOAD_METIS=OFF -DMETIS_DIR="/usr/include" && \

make -j2 install && \

rm -r /tmp/mmg_5_6_0 && \

rm -r /tmp/ParMmg_1_4_0 && \

cd / && \

These were part of an upgrade I did not finish in #10477 since some tests were giving problems. Also, a new version of MMG was released so it will be better to start all over again with the more recent releases.

unused MMG versions removed in #11587

matekelemen · 2023-10-21T10:21:24Z

Do we even need intel-oneapi-compiler-dpcpp-cpp-and-cpp-classic in the docker image?

We don't compile or test any applications with the intel compiler, and completely disregard Core test failures in the CI, so it seems to me that we basically don't support ICC. Take a look at this intel CI run as an example.

As a sidenote, the Kratos import prompt seems to be broken for ICC: it thinks it was compiled with clang:

 |  /           |                  

 ' /   __| _` | __|  _ \   __|    
Running KratosCore tests
 . \  |   (   | |   (   |\__ \  
_|\_\_|  \__,_|\__|\___/ ____/
           Multi-Physics 9.4."1"--ef00170-Custom-x86_64
/__w/Kratos/Kratos/bin/Custom/kratos/tests/test_KratosCore.py
           Compiled for GNU/Linux and Python3.10 with Clang-17.0
Compiled with threading and MPI support.

matekelemen · 2023-10-21T14:21:37Z

Disclaimer: this might not be too relevant to the problem at hand, but it's worth noting anyway.

I took a look at file sizes in our repo, and found that the worst offenders are:

20M     applications/FluidDynamicsApplication/tests/TwoFluidMassInletTest/mass_conservation_3d_results.json
18M     applications/ContactStructuralMechanicsApplication/custom_conditions/ALM_frictional_mortar_contact_condition.cpp
12M     applications/ContactStructuralMechanicsApplication/custom_conditions/penalty_frictional_mortar_contact_condition.cpp
3,7M    applications/ThermalDEMApplication/ThermalDEMTheory.pdf
3,6M    applications/ConvectionDiffusionApplication/test_examples/square.gid/square.lin
3,2M    applications/ContactStructuralMechanicsApplication/custom_conditions/ALM_frictionless_components_mortar_contact_condition.cpp
3,1M    applications/FluidDynamicsApplication/tests/AdjointVMSSensitivity2DTest/cylinder_slip_test_primal_results.json
3,0M    applications/SolidMechanicsApplication/tests/tests_data/dynamic_3d_beam.gif
3,0M    applications/FluidDynamicsApplication/tests/AdjointVMSSensitivity2DTest/cylinder_test_primal_results.json
2,9M    applications/ContactStructuralMechanicsApplication/custom_conditions/ALM_frictionless_mortar_contact_condition.cpp

That's right: none of the 10 largest files in our repository is an .mdpa or other mesh. I think .pdf, .lin and .gif files have no business being in the source repo, and .jsons used as references for tests should be converted to binaries. However, the most worrying part is that there are source files in that list (all of them in ContactStructuralMechanicsApplication):

18M     applications/ContactStructuralMechanicsApplication/custom_conditions/ALM_frictional_mortar_contact_condition.cpp
12M     applications/ContactStructuralMechanicsApplication/custom_conditions/penalty_frictional_mortar_contact_condition.cpp
2,9M    applications/ContactStructuralMechanicsApplication/custom_conditions/ALM_frictionless_mortar_contact_condition.cpp

To put this into perspective, the entirety of StructuralMechanicsApplication is 13M (including meshes, references, everything). If you're wondering "what on earth could be in a source file so huge", the answer is probably automatically generated code from symbolic maths programs.

To take a look at build times, I compiled ContactStructuralMechanicsApplication along my standard apps and sure enough, these sources take orders of magnitude longer to compile than an average translation unit:

build_analysis.txt

Putting it into perspective: these 3 source files take (a lot) longer to compile than the entirety of Core.

philbucher · 2023-10-22T19:57:29Z

Do we even need intel-oneapi-compiler-dpcpp-cpp-and-cpp-classic in the docker image?

We don't compile or test any applications with the intel compiler, and completely disregard Core test failures in the CI, so it seems to me that we basically don't support ICC. Take a look at this intel CI run as an example.

As a sidenote, the Kratos import prompt seems to be broken for ICC: it thinks it was compiled with clang:
 |  /           |                  

 ' /   __| _` | __|  _ \   __|    
Running KratosCore tests
 . \  |   (   | |   (   |\__ \  
_|\_\_|  \__,_|\__|\___/ ____/
           Multi-Physics 9.4."1"--ef00170-Custom-x86_64
/__w/Kratos/Kratos/bin/Custom/kratos/tests/test_KratosCore.py
           Compiled for GNU/Linux and Python3.10 with Clang-17.0
Compiled with threading and MPI support.

yes we need it, Intel is essential on clusters, and other proprietary hardware.
Only reason it does not run tests/has more apps enabled is bcs I have not managed to put in the time.
Also it is needed for pardiso

philbucher · 2023-10-22T20:07:15Z

As a sidenote, the Kratos import prompt seems to be broken for ICC: it thinks it was compiled with clang:

yes we need to fix this, its bcs intel uses clang or gcc internally

loumalouomega · 2023-10-23T07:19:37Z

That's right: none of the 10 largest files in our repository is an .mdpa or other mesh. I think .pdf, .lin and .gif files have no business being in the source repo, and .jsons used as references for tests should be converted to binaries. However, the most worrying part is that there are source files in that list (all of them in ContactStructuralMechanicsApplication):
18M     applications/ContactStructuralMechanicsApplication/custom_conditions/ALM_frictional_mortar_contact_condition.cpp
12M     applications/ContactStructuralMechanicsApplication/custom_conditions/penalty_frictional_mortar_contact_condition.cpp
2,9M    applications/ContactStructuralMechanicsApplication/custom_conditions/ALM_frictionless_mortar_contact_condition.cpp
To put this into perspective, the entirety of StructuralMechanicsApplication is 13M (including meshes, references, everything). If you're wondering "what on earth could be in a source file so huge", the answer is probably automatically generated code from symbolic maths programs.

Yes, the contact conditions are AD. The main problem is the number of derivatives terms and how they interact between them.

matekelemen · 2023-10-23T07:47:26Z

yes we need it, Intel is essential on clusters

Alright, but what about failing core tests? Even though some tests fail, the intel CI job is always marked successful. Part of the CI run I linked earlier:

......s.s...........................x...ss.ssss.....................................................................ssss.s.........E.........................................ssss............................................................................................................ss...............................................................s.....................sssssss..ss...ss.............................x........s............x....................................................................................................................................FFFF..................
======================================================================
ERROR: test_assign_master_slave_constraints_to_neighbours_process (test_processes.TestProcesses)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/__w/Kratos/Kratos/bin/Custom/kratos/tests/test_processes.py", line 2584, in test_assign_master_slave_constraints_to_neighbours_process
    self.assertVectorAlmostEqual(expected_weights, obtained_weights, delta=1e-6)  # Assert the vectors are equal within the specified tolerance
  File "/__w/Kratos/Kratos/bin/Custom/KratosMultiphysics/KratosUnittest.py", line 93, in assertVectorAlmostEqual
    self.assertAlmostEqual(v1, v2, places, LazyErrMsg(i, msg), delta)
  File "/usr/lib/python3.10/unittest/case.py", line 874, in assertAlmostEqual
    raise TypeError("specify delta or places not both")
TypeError: specify delta or places not both

======================================================================
FAIL: test_BinsDynamic_nodes (test_spatial_search.TestSpatialSearchSphere)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/__w/Kratos/Kratos/bin/Custom/kratos/tests/test_spatial_search.py", line 126, in test_BinsDynamic_nodes
    self.assertTrue(distance in distance_ref)
AssertionError: False is not true
...

Just asking because I don't have much experience with targeting intel clusters specifically: what's the performance gap between the intel compiler and GCC/Clang? I've read that numerics used to be much faster on ICC (~10 years ago) but nowadays that's no longer the case.

Also it is needed for pardiso

Doesn't pardiso have a C interface? I though it didn't matter what I compiled my app with if I'm just linking to it.

roigcarlo · 2023-10-23T08:42:27Z

To put this into perspective, the entirety of StructuralMechanicsApplication is 13M

The problem are not the source files. Trying to solve the problem with those is like trying to drain the sea with a spoon.

Main problem are the compilation artifacts. For example, taking Fluid + Structural with Debug (FullDebug are larger but don't have the files here):

1.6G	bin/Debug/libs
1.6G	bin/Debug/tests
4.4G	build/Debug/applications
3.6G	build/Debug/kratos

If you want a more detailed view per app, things look like this (for the build folder, that corresponds to the 4.4G above):

880M	ContactStructuralMechanicsApplication
834M	StructuralMechanicsApplication
639M	ConstitutiveLawsApplication
447M	CompressiblePotentialFlowApplication
436M	LinearSolversApplication
1.3G	FluidDynamicsApplication

Easy to see where the problem comes from if you scale this to ~30 apps. Having a conservative 500Mb per app that 15G of compilation artifacts and put another 5G in terms of shared libs. So total of 20GB for the whole thing. (Remember Fulldebug compilation is heavier).

I've read that numerics used to be much faster on ICC (~10 years ago) but nowadays that's no longer the case.

True, but I think the problem is more in the lines of not having an alternative to ICC in Intel clusters (If they are from intel, my experience is that you have gcc 4.8.5 or 5 if you are lucky, icc and nothing else)

philbucher · 2023-10-23T08:43:42Z

I have an idea how to remove the intel stuff from docker, will try after work

matekelemen · 2023-10-23T11:52:22Z

The problem are not the source files. Trying to solve the problem with those is like trying to drain the sea with a spoon.

That's true, and I wouldn't dare propose people use shorter class/function/variable names and write more concise code just to reduce size. That's the reason for my disclaimer in the beginning; I just wanted to share what I found while digging through our repo.

Main problem are the compilation artifacts.

I'll just point out that artifact sizes are directly influenced by the sizes of translation units, so chunky source files necessarily lead to fat object files and library sizes. The rampant tendency to implement everything in headers probably doesn't help either.

Is there any chance of getting a privately hosted CI? I don't know much about the administration of Kratos, but it seems to me that anything we do to fix this issue will be a mere temporary solution if Kratos continues expanding as a project.

philbucher · 2023-10-23T12:47:36Z

Github offers to use privately hosted runners

But then someone has to pay and maintain it...

We started setting this up only few months before GitHub launched the actions, but then quickly abandoned it due to the overhead.

I think in the short/mid-term it makes sense to improve the CI (e.g. not always build everything, doing it in steps etc...)
But maybe this will not be good enough in the long term. But since we could "easily" add self hosted runners (which would profit from the same improvements), I think it is worth investing effort

roigcarlo · 2023-10-23T13:03:56Z

Github offers to use privately hosted runners

Also, security is a nightmare, since basically you are opening the doors to whatever machine you are using to host the runner.

philbucher · 2023-10-23T13:14:29Z

Technically we can also pay GitHub to use their runners (faster, more memory etc), but I don't they are cheap.
Never done the math though

roigcarlo added 4 commits September 14, 2023 14:58

Adding cleanup action

902647a

Cleaning manually

5545bd0

Adding sudo

650e703

Wrapping up and testing

d93b16d

philbucher reviewed Sep 14, 2023

View reviewed changes

philbucher mentioned this pull request Sep 21, 2023

[CI][Docker] remove unused versions of MMG #11587

Merged

roigcarlo mentioned this pull request Oct 18, 2023

FullDebug - Gcc Pipeline #11564

Open

This was referenced Oct 23, 2023

[CI] improve intel compiler handling #11714

Closed

[Docker] only new intel compiler #11715

Merged

roigcarlo closed this Jan 19, 2024

roigcarlo deleted the ci/free-disk-action branch January 19, 2024 16:09

	# install MMG v5.6.0
	git clone -b 'v5.6.0' --depth 1 https://github.com/MmgTools/mmg /tmp/mmg_5_6_0 && \
	mkdir /tmp/mmg_5_6_0/build && \
	mkdir -p /external_libraries/mmg/mmg_5_6_0 && \
	cd /tmp/mmg_5_6_0/build && \
	cmake .. -DCMAKE_RULE_MESSAGES=OFF -DCMAKE_C_FLAGS="-w" -DCMAKE_CXX_FLAGS="-w" -DCMAKE_INSTALL_PREFIX="/external_libraries/mmg/mmg_5_6_0" -DUSE_SCOTCH=OFF -DLIBMMG3D_SHARED=ON -DLIBMMG2D_SHARED=ON -DLIBMMGS_SHARED=ON -DLIBMMG_SHARED=ON && \
	make -j2 install && \
	cd / && \

	# install ParMmg v1.4.0
	git clone -b 'v1.4.0' --depth 1 https://github.com/MmgTools/ParMmg /tmp/ParMmg_1_4_0 && \
	mkdir /tmp/ParMmg_1_4_0/build && \
	mkdir -p /external_libraries/ParMmg_1_4_0 && \
	cd /tmp/ParMmg_1_4_0/build && \
	cmake .. -DCMAKE_RULE_MESSAGES=OFF -DCMAKE_C_FLAGS="-w" -DCMAKE_CXX_FLAGS="-w" -DCMAKE_INSTALL_PREFIX="/external_libraries/ParMmg_1_4_0" -DUSE_SCOTCH=OFF -DLIBPARMMG_SHARED=ON -DDOWNLOAD_MMG=OFF -DMMG_DIR="/tmp/mmg_5_6_0" -DMMG_BUILDDIR="/tmp/mmg_5_6_0/build" -DDOWNLOAD_METIS=OFF -DMETIS_DIR="/usr/include" && \
	make -j2 install && \
	rm -r /tmp/mmg_5_6_0 && \
	rm -r /tmp/ParMmg_1_4_0 && \
	cd / && \

Conversation

roigcarlo commented Sep 14, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matekelemen commented Oct 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

matekelemen commented Oct 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

philbucher commented Oct 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

philbucher commented Oct 22, 2023

Uh oh!

loumalouomega commented Oct 23, 2023

Uh oh!

matekelemen commented Oct 23, 2023

Uh oh!

roigcarlo commented Oct 23, 2023

Uh oh!

philbucher commented Oct 23, 2023

Uh oh!

matekelemen commented Oct 23, 2023

Uh oh!

philbucher commented Oct 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

roigcarlo commented Oct 23, 2023

Uh oh!

philbucher commented Oct 23, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

matekelemen commented Oct 21, 2023 •

edited

Loading

matekelemen commented Oct 21, 2023 •

edited

Loading

philbucher commented Oct 22, 2023 •

edited

Loading

philbucher commented Oct 23, 2023 •

edited

Loading