Skip to content

Add volumes in 'Expunging' state to storage cleanup thread and during delete storage pool#12602

Open
sureshanaparti wants to merge 2 commits intoapache:4.20from
shapeblue:cleanup-volumes-in-expunging-state
Open

Add volumes in 'Expunging' state to storage cleanup thread and during delete storage pool#12602
sureshanaparti wants to merge 2 commits intoapache:4.20from
shapeblue:cleanup-volumes-in-expunging-state

Conversation

@sureshanaparti
Copy link
Contributor

@sureshanaparti sureshanaparti commented Feb 6, 2026

Description

This PR adds volumes in 'Expunging' state to storage cleanup thread and during delete storage pool.

There are some scenarios where volumes that get stuck in Expunging due to failure in the Expunge flow (and no failure callback received - due to Management server crash/restart, network loss etc), but not reverting back to Destroy state. They remain stuck in that state and expunge on them is never attempted leaving these volumes on the storage.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • Build/CI
  • Test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

How did you try to break this feature and the system with this change?

@sureshanaparti sureshanaparti changed the title Add volumes in 'Expunging' state in storage cleanup thread and during delete storage pool Add volumes in 'Expunging' state to storage cleanup thread and during delete storage pool Feb 6, 2026
@sureshanaparti sureshanaparti force-pushed the cleanup-volumes-in-expunging-state branch from dd23e40 to ca2552d Compare February 6, 2026 06:48
@sureshanaparti sureshanaparti added this to the 4.20.3 milestone Feb 6, 2026
@sureshanaparti
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@codecov
Copy link

codecov bot commented Feb 6, 2026

Codecov Report

❌ Patch coverage is 14.28571% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 16.25%. Comparing base (3d7d412) to head (1d8e142).
⚠️ Report is 9 commits behind head on 4.20.

Files with missing lines Patch % Lines
...main/java/com/cloud/storage/dao/VolumeDaoImpl.java 20.00% 4 Missing ⚠️
...e/cloudstack/storage/volume/VolumeServiceImpl.java 0.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               4.20   #12602      +/-   ##
============================================
- Coverage     16.26%   16.25%   -0.02%     
+ Complexity    13429    13420       -9     
============================================
  Files          5661     5662       +1     
  Lines        500010   500162     +152     
  Branches      60715    60732      +17     
============================================
- Hits          81331    81300      -31     
- Misses       409606   409778     +172     
- Partials       9073     9084      +11     
Flag Coverage Δ
uitests 4.15% <ø> (ø)
unittests 17.10% <14.28%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 16722

@sureshanaparti
Copy link
Contributor Author

@blueorangutan test

@blueorangutan
Copy link

@sureshanaparti a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clgtm

@blueorangutan
Copy link

[SF] Trillian test result (tid-15415)
Environment: kvm-ol8 (x2), zone: Advanced Networking with Mgmt server ol8
Total time taken: 53230 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr12602-t15415-kvm-ol8.zip
Smoke tests completed. 141 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

@sureshanaparti
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 16868

Copy link
Collaborator

@abh1sar abh1sar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM

Copy link
Member

@kiranchavala kiranchavala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Tested manually

  1. Change the global setting

”storage.cleanup.delay” and storage.cleanup.interval to 200

  1. Launch a volume in cloudstack

  2. Destroy the volume in cloudstack , make sure its in Destroy state

Example

mysql> select id,name,state,removed from volumes where name="test2" ;
+----+-------+---------+---------+
| id | name  | state   | removed |
+----+-------+---------+---------+
|  6 | test2 | Destroy | NULL    |
+----+-------+---------+---------+
1 row in set (0.00 sec)
  1. Set the volume state to expunging
mysql> UPDATE volumes SET state = 'Expunging' WHERE id = 6;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0
  1. Stop the cloudstack service

  2. Start the cloudstack service

  3. Check the logs

  4. Cloudstack cleaned up the volumes which were in expunging state

mysql> select id,name,state,removed from volumes where id=10;
+----+--------+-----------+---------+
| id | name   | state     | removed |
+----+--------+-----------+---------+
| 10 | test | Expunging | NULL    |
+----+--------+-----------+---------+
1 row in set (0.00 sec)

mysql> select id,name,state,removed from volumes where id=10;
+----+--------+----------+---------------------+
| id | name   | state    | removed             |
+----+--------+----------+---------------------+
| 10 | test | Expunged | 2026-02-18 10:26:44 |
+----+--------+----------+---------------------+

logs

root@ref-trl-11081-k-Mol8-kiran-chavala-mgmt1 ~]# cat  /var/log/cloudstack/management/management-server.log |grep -i "logid:4c2827f6"
2026-02-18 10:26:43,885 DEBUG [c.c.s.StorageManagerImpl] (StorageManager-Scavenger-1:[ctx-212727e0]) (logid:4c2827f6) Storage pool garbage collector found [0] Templates to be cleaned up in storage pool [StoragePool {"id":1,"name":"ref-trl-11081-k-Mol8-kiran-chavala-kvm-pri1","poolType":"NetworkFilesystem","uuid":"48025699-3b4f-3165-ab47-5707e5511771"}].
2026-02-18 10:26:43,887 DEBUG [c.c.s.StorageManagerImpl] (StorageManager-Scavenger-1:[ctx-212727e0]) (logid:4c2827f6) Storage pool garbage collector found [0] Templates to be cleaned up in storage pool [StoragePool {"id":2,"name":"ref-trl-11081-k-Mol8-kiran-chavala-kvm-pri2","poolType":"NetworkFilesystem","uuid":"b730a16f-f034-3a44-b659-9a169aabd970"}].
2026-02-18 10:26:43,889 DEBUG [c.c.s.StatsCollector] (StorageManager-Scavenger-1:[ctx-212727e0]) (logid:4c2827f6) Verifying image storage [ImageStore {"id":1,"name":"NFS:\/\/10.0.32.4\/acs\/secondary\/ref-trl-11081-k-Mol8-kiran-chavala\/ref-trl-11081-k-Mol8-kiran-chavala-sec1","uuid":"7e5d0bb2-f8ea-497e-bfff-f9f9fd3da59f"}]. Capacity: total=[2.6357 TB], used=[1.5451 TB], threshold=[95.00%].
2026-02-18 10:26:43,890 DEBUG [c.c.s.StorageManagerImpl] (StorageManager-Scavenger-1:[ctx-212727e0]) (logid:4c2827f6) Secondary storage garbage collector found 0 Templates to cleanup on template_store_ref for store: ImageStore {"id":1,"name":"NFS:\/\/10.0.32.4\/acs\/secondary\/ref-trl-11081-k-Mol8-kiran-chavala\/ref-trl-11081-k-Mol8-kiran-chavala-sec1","uuid":"7e5d0bb2-f8ea-497e-bfff-f9f9fd3da59f"}
2026-02-18 10:26:43,890 DEBUG [c.c.s.StorageManagerImpl] (StorageManager-Scavenger-1:[ctx-212727e0]) (logid:4c2827f6) Secondary storage garbage collector found 0 snapshots to cleanup on snapshot_store_ref for store: ImageStore {"id":1,"name":"NFS:\/\/10.0.32.4\/acs\/secondary\/ref-trl-11081-k-Mol8-kiran-chavala\/ref-trl-11081-k-Mol8-kiran-chavala-sec1","uuid":"7e5d0bb2-f8ea-497e-bfff-f9f9fd3da59f"}
2026-02-18 10:26:43,892 DEBUG [c.c.s.StorageManagerImpl] (StorageManager-Scavenger-1:[ctx-212727e0]) (logid:4c2827f6) Secondary storage garbage collector found 0 volumes to cleanup on volume_store_ref for store: ImageStore {"id":1,"name":"NFS:\/\/10.0.32.4\/acs\/secondary\/ref-trl-11081-k-Mol8-kiran-chavala\/ref-trl-11081-k-Mol8-kiran-chavala-sec1","uuid":"7e5d0bb2-f8ea-497e-bfff-f9f9fd3da59f"}
2026-02-18 10:26:43,900 INFO  [o.a.c.s.v.VolumeServiceImpl] (StorageManager-Scavenger-1:[ctx-212727e0]) (logid:4c2827f6) Volume VolumeObject {"dataStore":"StoragePool {\"id\":1,\"name\":\"ref-trl-11081-k-Mol8-kiran-chavala-kvm-pri1\",\"poolType\":\"NetworkFilesystem\",\"uuid\":\"48025699-3b4f-3165-ab47-5707e5511771\"}","volumeVO":"Volume {\"id\":10,\"instanceId\":null,\"name\":\"testgd\",\"uuid\":\"450d3d1c-1232-4f82-b5ad-2e7a1a2ec400\",\"volumeType\":\"DATADISK\"}"} is already in Expunging, retrying
2026-02-18 10:26:43,939 DEBUG [c.c.h.o.r.Ovm3HypervisorGuru] (StorageManager-Scavenger-1:[ctx-212727e0]) (logid:4c2827f6) getCommandHostDelegation: class org.apache.cloudstack.storage.command.DeleteCommand
2026-02-18 10:26:43,941 DEBUG [c.c.h.XenServerGuru] (StorageManager-Scavenger-1:[ctx-212727e0]) (logid:4c2827f6) We are returning the default host to execute commands because the command is not of Copy type.
2026-02-18 10:26:43,941 DEBUG [c.c.a.m.ClusteredAgentManagerImpl] (StorageManager-Scavenger-1:[ctx-212727e0]) (logid:4c2827f6) Wait time setting on org.apache.cloudstack.storage.command.DeleteCommand is 1800 seconds
2026-02-18 10:26:43,942 DEBUG [c.c.a.m.ClusteredAgentAttache] (StorageManager-Scavenger-1:[ctx-212727e0]) (logid:4c2827f6) Seq 2-1107885508333142034: Routed from 32987932525600
2026-02-18 10:26:43,943 DEBUG [c.c.a.t.Request] (StorageManager-Scavenger-1:[ctx-212727e0]) (logid:4c2827f6) Seq 1-1107885508333142034: Sending  { Cmd , MgmtId: 32987932525600, via: 1(ref-trl-11081-k-Mol8-kiran-chavala-kvm1), Ver: v1, Flags: 100011, [{"org.apache.cloudstack.storage.command.DeleteCommand":{"data":{"org.apache.cloudstack.storage.to.VolumeObjectTO":{"uuid":"450d3d1c-1232-4f82-b5ad-2e7a1a2ec400","volumeType":"DATADISK","dataStore":{"org.apache.cloudstack.storage.to.PrimaryDataStoreTO":{"uuid":"48025699-3b4f-3165-ab47-5707e5511771","name":"ref-trl-11081-k-Mol8-kiran-chavala-kvm-pri1","id":"1","poolType":"NetworkFilesystem","host":"10.0.32.4","path":"/acs/primary/ref-trl-11081-k-Mol8-kiran-chavala/ref-trl-11081-k-Mol8-kiran-chavala-kvm-pri1","port":"2049","url":"NetworkFilesystem://10.0.32.4/acs/primary/ref-trl-11081-k-Mol8-kiran-chavala/ref-trl-11081-k-Mol8-kiran-chavala-kvm-pri1/?ROLE=Primary&STOREUUID=48025699-3b4f-3165-ab47-5707e5511771","isManaged":"false"}},"name":"testgd","size":"(100.00 GB) 107374182400","path":"450d3d1c-1232-4f82-b5ad-2e7a1a2ec400","volumeId":"10","accountId":"2","format":"QCOW2","provisioningType":"THIN","poolId":"1","id":"10","hypervisorType":"KVM","directDownload":"false","deployAsIs":"false","followRedirects":"false"}},"wait":"0","bypassHostMaintenance":"false"}}] }
2026-02-18 10:26:44,075 DEBUG [c.c.a.t.Request] (StorageManager-Scavenger-1:[ctx-212727e0]) (logid:4c2827f6) Seq 1-1107885508333142034: Received:  { Ans: , MgmtId: 32987932525600, via: 1(ref-trl-11081-k-Mol8-kiran-chavala-kvm1), Ver: v1, Flags: 10, { Answer } }
2026-02-18 10:26:44,083 INFO  [o.a.c.s.v.VolumeServiceImpl] (StorageManager-Scavenger-1:[ctx-212727e0]) (logid:4c2827f6) Volume VolumeObject {"dataStore":"StoragePool {\"id\":1,\"name\":\"ref-trl-11081-k-Mol8-kiran-chavala-kvm-pri1\",\"poolType\":\"NetworkFilesystem\",\"uuid\":\"48025699-3b4f-3165-ab47-5707e5511771\"}","volumeVO":"Volume {\"id\":10,\"instanceId\":null,\"name\":\"testgd\",\"uuid\":\"450d3d1c-1232-4f82-b5ad-2e7a1a2ec400\",\"volumeType\":\"DATADISK\"}"} is not referred anywhere, remove it from volumes table
2026-02-18 10:26:44,083 DEBUG [c.c.s.d.VolumeDaoImpl] (StorageManager-Scavenger-1:[ctx-212727e0]) (logid:4c2827f6) Removing volume 10 from DB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants