Skip to content

DRAFT Disk usage CI#11566

Closed
roigcarlo wants to merge 4 commits intomasterfrom
ci/free-disk-action
Closed

DRAFT Disk usage CI#11566
roigcarlo wants to merge 4 commits intomasterfrom
ci/free-disk-action

Conversation

@roigcarlo
Copy link
Member

No description provided.


jobs:
# Free disk space from the github ubuntu image, as is very bloated (Android and DotNet are safe and about ~11G)
# Task is overlaped with other runs but should not cause a problem
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead I would add needs: free-disk to ensure that this job is run before the others
otherwise if for whatever reason this one does not start, then the others might randomly fail

see https://docs.github.com/en/actions/using-workflows/about-workflows#creating-dependent-jobs

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly is more complex...

It appears that every job runs in an independent worker ( which is tied to a different ubuntu-latest image) which means that changes in the worker do not propagate to other jobs, being dependents or not.

So basically when I run this:
1 - Workflow starts
2 - free-disk and ubuntu-fulldbg-gcc runs, each in a different worker
3 - free-disk ends and its worker ends with it, so no changes in disk

With the needs:
1 - Workflow starts
2 - free-disk starts, with its assigned worker, ubuntu-fulldbg-gcc gets into queue
3 - free-disk ends and its worker ends with it. ubuntu-fulldbg-gcc starts with another worker.
4 - We end up in the same place.

By manually deleting those libs the problem its solved, but there is a detail which is preventing it to work. The rationale behind separating the free-disk step into a different job, is that compilation jobs are running inside custom containers, so I don't have direct access to the worker.

As the jobs do not share worker, and the jobs that need more space run in a custom container, it is impossible to liberate space normally.

At this points I am trying several solutions:
1 - Trim the KratosUbuntu-22.04 image with less software ( difficult, as the main offender here is Intel's OneApi 8G, 90% of the image size, and we need it )
2 - Divide the compilation in substeps and clear the intermediate compilation artifacts
3 - Manually mount the docker image and create a custom entry point that executes. Feasible but a tone of work and will prevent us from splinting the job in different steps.
4 - Mounting the host runner directories that we want to delete in the container, and delete its contents. ( God forgive me for suggesting this)
5 - Pray for github to implement pre-step and post-step. (Spoiler: not gonna happen)

I am trying to implement 2, but if not the case we will have to think the whole process differently....

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah so it is kinda a coincidence that the current solution allows to run the free-disk job on the same worker?

Regarding your options, I already optimized the image quite a bit regarding size

Also I tried to use custom entry points some time ago and it was not (yet?) supported

So if it works, then kudos for figuring it out

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No no, this solution does not allow for that, they are different workers ( is still failing in this PR)

Custom entry points are ok now, but then you can only do 1 job, or multiple but then you have to handle the volumes manually to transfer data and turn up the container for each job, so I assume it will end up being slow... but yea, its possible now and that's my backup solution if I cannot reorganize the compilation so we can delete artifacts while compiling.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could split the image into two, one base image with all the compilers and one with the intel stuff
The intel build only has the core atm, which takes less space

Only downside is that the gcc builds would not have Pardiso enabled

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also there are multiple versions of mmg and parmmg installed

@loumalouomega @marcnunezc can we remove the older versions to safe space?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi!

Currently, the unused versions are (which can be safely removed):

# install MMG v5.6.0
git clone -b 'v5.6.0' --depth 1 https://github.com/MmgTools/mmg /tmp/mmg_5_6_0 && \
mkdir /tmp/mmg_5_6_0/build && \
mkdir -p /external_libraries/mmg/mmg_5_6_0 && \
cd /tmp/mmg_5_6_0/build && \
cmake .. -DCMAKE_RULE_MESSAGES=OFF -DCMAKE_C_FLAGS="-w" -DCMAKE_CXX_FLAGS="-w" -DCMAKE_INSTALL_PREFIX="/external_libraries/mmg/mmg_5_6_0" -DUSE_SCOTCH=OFF -DLIBMMG3D_SHARED=ON -DLIBMMG2D_SHARED=ON -DLIBMMGS_SHARED=ON -DLIBMMG_SHARED=ON && \
make -j2 install && \
cd / && \

# install ParMmg v1.4.0
git clone -b 'v1.4.0' --depth 1 https://github.com/MmgTools/ParMmg /tmp/ParMmg_1_4_0 && \
mkdir /tmp/ParMmg_1_4_0/build && \
mkdir -p /external_libraries/ParMmg_1_4_0 && \
cd /tmp/ParMmg_1_4_0/build && \
cmake .. -DCMAKE_RULE_MESSAGES=OFF -DCMAKE_C_FLAGS="-w" -DCMAKE_CXX_FLAGS="-w" -DCMAKE_INSTALL_PREFIX="/external_libraries/ParMmg_1_4_0" -DUSE_SCOTCH=OFF -DLIBPARMMG_SHARED=ON -DDOWNLOAD_MMG=OFF -DMMG_DIR="/tmp/mmg_5_6_0" -DMMG_BUILDDIR="/tmp/mmg_5_6_0/build" -DDOWNLOAD_METIS=OFF -DMETIS_DIR="/usr/include" && \
make -j2 install && \
rm -r /tmp/mmg_5_6_0 && \
rm -r /tmp/ParMmg_1_4_0 && \
cd / && \

These were part of an upgrade I did not finish in #10477 since some tests were giving problems. Also, a new version of MMG was released so it will be better to start all over again with the more recent releases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused MMG versions removed in #11587

@matekelemen
Copy link
Contributor

matekelemen commented Oct 21, 2023

Do we even need intel-oneapi-compiler-dpcpp-cpp-and-cpp-classic in the docker image?

We don't compile or test any applications with the intel compiler, and completely disregard Core test failures in the CI, so it seems to me that we basically don't support ICC. Take a look at this intel CI run as an example.

As a sidenote, the Kratos import prompt seems to be broken for ICC: it thinks it was compiled with clang:

 |  /           |                  

 ' /   __| _` | __|  _ \   __|    
Running KratosCore tests
 . \  |   (   | |   (   |\__ \  
_|\_\_|  \__,_|\__|\___/ ____/
           Multi-Physics 9.4."1"--ef00170-Custom-x86_64
/__w/Kratos/Kratos/bin/Custom/kratos/tests/test_KratosCore.py
           Compiled for GNU/Linux and Python3.10 with Clang-17.0
Compiled with threading and MPI support.

@matekelemen
Copy link
Contributor

matekelemen commented Oct 21, 2023

Disclaimer: this might not be too relevant to the problem at hand, but it's worth noting anyway.

I took a look at file sizes in our repo, and found that the worst offenders are:

20M     applications/FluidDynamicsApplication/tests/TwoFluidMassInletTest/mass_conservation_3d_results.json
18M     applications/ContactStructuralMechanicsApplication/custom_conditions/ALM_frictional_mortar_contact_condition.cpp
12M     applications/ContactStructuralMechanicsApplication/custom_conditions/penalty_frictional_mortar_contact_condition.cpp
3,7M    applications/ThermalDEMApplication/ThermalDEMTheory.pdf
3,6M    applications/ConvectionDiffusionApplication/test_examples/square.gid/square.lin
3,2M    applications/ContactStructuralMechanicsApplication/custom_conditions/ALM_frictionless_components_mortar_contact_condition.cpp
3,1M    applications/FluidDynamicsApplication/tests/AdjointVMSSensitivity2DTest/cylinder_slip_test_primal_results.json
3,0M    applications/SolidMechanicsApplication/tests/tests_data/dynamic_3d_beam.gif
3,0M    applications/FluidDynamicsApplication/tests/AdjointVMSSensitivity2DTest/cylinder_test_primal_results.json
2,9M    applications/ContactStructuralMechanicsApplication/custom_conditions/ALM_frictionless_mortar_contact_condition.cpp

That's right: none of the 10 largest files in our repository is an .mdpa or other mesh. I think .pdf, .lin and .gif files have no business being in the source repo, and .jsons used as references for tests should be converted to binaries. However, the most worrying part is that there are source files in that list (all of them in ContactStructuralMechanicsApplication):

18M     applications/ContactStructuralMechanicsApplication/custom_conditions/ALM_frictional_mortar_contact_condition.cpp
12M     applications/ContactStructuralMechanicsApplication/custom_conditions/penalty_frictional_mortar_contact_condition.cpp
2,9M    applications/ContactStructuralMechanicsApplication/custom_conditions/ALM_frictionless_mortar_contact_condition.cpp

To put this into perspective, the entirety of StructuralMechanicsApplication is 13M (including meshes, references, everything). If you're wondering "what on earth could be in a source file so huge", the answer is probably automatically generated code from symbolic maths programs.

To take a look at build times, I compiled ContactStructuralMechanicsApplication along my standard apps and sure enough, these sources take orders of magnitude longer to compile than an average translation unit:

build_analysis.txt

Putting it into perspective: these 3 source files take (a lot) longer to compile than the entirety of Core.

@philbucher
Copy link
Member

philbucher commented Oct 22, 2023

Do we even need intel-oneapi-compiler-dpcpp-cpp-and-cpp-classic in the docker image?

We don't compile or test any applications with the intel compiler, and completely disregard Core test failures in the CI, so it seems to me that we basically don't support ICC. Take a look at this intel CI run as an example.

As a sidenote, the Kratos import prompt seems to be broken for ICC: it thinks it was compiled with clang:

 |  /           |                  

 ' /   __| _` | __|  _ \   __|    
Running KratosCore tests
 . \  |   (   | |   (   |\__ \  
_|\_\_|  \__,_|\__|\___/ ____/
           Multi-Physics 9.4."1"--ef00170-Custom-x86_64
/__w/Kratos/Kratos/bin/Custom/kratos/tests/test_KratosCore.py
           Compiled for GNU/Linux and Python3.10 with Clang-17.0
Compiled with threading and MPI support.

yes we need it, Intel is essential on clusters, and other proprietary hardware.
Only reason it does not run tests/has more apps enabled is bcs I have not managed to put in the time.
Also it is needed for pardiso

@philbucher
Copy link
Member

As a sidenote, the Kratos import prompt seems to be broken for ICC: it thinks it was compiled with clang:

yes we need to fix this, its bcs intel uses clang or gcc internally

@loumalouomega
Copy link
Member

That's right: none of the 10 largest files in our repository is an .mdpa or other mesh. I think .pdf, .lin and .gif files have no business being in the source repo, and .jsons used as references for tests should be converted to binaries. However, the most worrying part is that there are source files in that list (all of them in ContactStructuralMechanicsApplication):

18M     applications/ContactStructuralMechanicsApplication/custom_conditions/ALM_frictional_mortar_contact_condition.cpp
12M     applications/ContactStructuralMechanicsApplication/custom_conditions/penalty_frictional_mortar_contact_condition.cpp
2,9M    applications/ContactStructuralMechanicsApplication/custom_conditions/ALM_frictionless_mortar_contact_condition.cpp

To put this into perspective, the entirety of StructuralMechanicsApplication is 13M (including meshes, references, everything). If you're wondering "what on earth could be in a source file so huge", the answer is probably automatically generated code from symbolic maths programs.

Yes, the contact conditions are AD. The main problem is the number of derivatives terms and how they interact between them.

@matekelemen
Copy link
Contributor

yes we need it, Intel is essential on clusters

Alright, but what about failing core tests? Even though some tests fail, the intel CI job is always marked successful. Part of the CI run I linked earlier:

......s.s...........................x...ss.ssss.....................................................................ssss.s.........E.........................................ssss............................................................................................................ss...............................................................s.....................sssssss..ss...ss.............................x........s............x....................................................................................................................................FFFF..................
======================================================================
ERROR: test_assign_master_slave_constraints_to_neighbours_process (test_processes.TestProcesses)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/__w/Kratos/Kratos/bin/Custom/kratos/tests/test_processes.py", line 2584, in test_assign_master_slave_constraints_to_neighbours_process
    self.assertVectorAlmostEqual(expected_weights, obtained_weights, delta=1e-6)  # Assert the vectors are equal within the specified tolerance
  File "/__w/Kratos/Kratos/bin/Custom/KratosMultiphysics/KratosUnittest.py", line 93, in assertVectorAlmostEqual
    self.assertAlmostEqual(v1, v2, places, LazyErrMsg(i, msg), delta)
  File "/usr/lib/python3.10/unittest/case.py", line 874, in assertAlmostEqual
    raise TypeError("specify delta or places not both")
TypeError: specify delta or places not both

======================================================================
FAIL: test_BinsDynamic_nodes (test_spatial_search.TestSpatialSearchSphere)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/__w/Kratos/Kratos/bin/Custom/kratos/tests/test_spatial_search.py", line 126, in test_BinsDynamic_nodes
    self.assertTrue(distance in distance_ref)
AssertionError: False is not true
...

Just asking because I don't have much experience with targeting intel clusters specifically: what's the performance gap between the intel compiler and GCC/Clang? I've read that numerics used to be much faster on ICC (~10 years ago) but nowadays that's no longer the case.

Also it is needed for pardiso

Doesn't pardiso have a C interface? I though it didn't matter what I compiled my app with if I'm just linking to it.

@roigcarlo
Copy link
Member Author

To put this into perspective, the entirety of StructuralMechanicsApplication is 13M

The problem are not the source files. Trying to solve the problem with those is like trying to drain the sea with a spoon.

Main problem are the compilation artifacts. For example, taking Fluid + Structural with Debug (FullDebug are larger but don't have the files here):

1.6G	bin/Debug/libs
1.6G	bin/Debug/tests
4.4G	build/Debug/applications
3.6G	build/Debug/kratos

If you want a more detailed view per app, things look like this (for the build folder, that corresponds to the 4.4G above):

880M	ContactStructuralMechanicsApplication
834M	StructuralMechanicsApplication
639M	ConstitutiveLawsApplication
447M	CompressiblePotentialFlowApplication
436M	LinearSolversApplication
1.3G	FluidDynamicsApplication

Easy to see where the problem comes from if you scale this to ~30 apps. Having a conservative 500Mb per app that 15G of compilation artifacts and put another 5G in terms of shared libs. So total of 20GB for the whole thing. (Remember Fulldebug compilation is heavier).

I've read that numerics used to be much faster on ICC (~10 years ago) but nowadays that's no longer the case.

True, but I think the problem is more in the lines of not having an alternative to ICC in Intel clusters (If they are from intel, my experience is that you have gcc 4.8.5 or 5 if you are lucky, icc and nothing else)

@philbucher
Copy link
Member

I have an idea how to remove the intel stuff from docker, will try after work

@matekelemen
Copy link
Contributor

The problem are not the source files. Trying to solve the problem with those is like trying to drain the sea with a spoon.

That's true, and I wouldn't dare propose people use shorter class/function/variable names and write more concise code just to reduce size. That's the reason for my disclaimer in the beginning; I just wanted to share what I found while digging through our repo.

Main problem are the compilation artifacts.

I'll just point out that artifact sizes are directly influenced by the sizes of translation units, so chunky source files necessarily lead to fat object files and library sizes. The rampant tendency to implement everything in headers probably doesn't help either.

Is there any chance of getting a privately hosted CI? I don't know much about the administration of Kratos, but it seems to me that anything we do to fix this issue will be a mere temporary solution if Kratos continues expanding as a project.

@philbucher
Copy link
Member

philbucher commented Oct 23, 2023

Github offers to use privately hosted runners

But then someone has to pay and maintain it...

We started setting this up only few months before GitHub launched the actions, but then quickly abandoned it due to the overhead.

I think in the short/mid-term it makes sense to improve the CI (e.g. not always build everything, doing it in steps etc...)
But maybe this will not be good enough in the long term. But since we could "easily" add self hosted runners (which would profit from the same improvements), I think it is worth investing effort

@roigcarlo
Copy link
Member Author

Github offers to use privately hosted runners

Also, security is a nightmare, since basically you are opening the doors to whatever machine you are using to host the runner.

@philbucher
Copy link
Member

Technically we can also pay GitHub to use their runners (faster, more memory etc), but I don't they are cheap.
Never done the math though

@roigcarlo roigcarlo closed this Jan 19, 2024
@roigcarlo roigcarlo deleted the ci/free-disk-action branch January 19, 2024 16:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants