5 May 2023

Migration of 50.000 secrets in production

Context
Project viability study
Migration preparation
Vault token management
Migration
Feedbacks

Context

As explained in the article Dynamic secret management in Puppet, trocla was quickly set up by co-workers in the Puppet Claranet infastructure for the secrets management. Over time there has been several developments, particularly on the definition of the backend. Initially tested in memory, it went through several stages, file, MySQL, PostgreSQL, before remaining in a stable version, which lasted several years, on a simple MySQL in primary/replica replication. Although the database was very light, it contains more than 50k secrets.

In 2019, after many discussions, a co-worker wanted to test Vault, and potentially migrate the Puppet secret management there. We already had several goals which could only be validated by this migration:

End of support for an aging OS/MySQL infrastructure
A build-in native API, which is the very principle of how Vault works
A gain of security provided by the encryption and access operation carried by Vault (encryption currently carried by an x509 certificate)
Convergence with the technological synergy of the french group
Metadata management (particularly history dates and versioning)
Support for ACL and relationship with AD
An integrated webui, until then the secrets management by non-ops was done with a custom webui which stored the secrets on a dedicated Trocla API
Improved resiliency, with consul backend for Vault

Project viability study

A year goes by before we really relaunch ourselves on the subject, and we can start by listing the first points of difficulty that this migration represents.

The 4-5k Puppet agents runs in production every 30min, to retrieve or create thousands of systems, ftp and database account passwords, as well as hundreds of certificates. A data inconsistency would knock out several hundreds of our clients websites.
Vault doesn’t work like Trocla, the main interest of the library is to be able to manage a multitude of formats and to be able to dynamically generate secrets if they do not exist.
An interaction of the content generated by Puppet with the Trocla API and Ansible is in production with a dedicated library. A migration to Vault would involve reviewing almost 1k of ansible playbook.
Trocla has a significant importance in the Puppet code base, it is used on our profiles modules (almost 100 modules) or in the hiera content (almost 15k references).
The content of the Trocla database is constantly in motion, whether dynamically or via OPS interaction.

Migration preparation

After listing all these points, the task turned out to be more complex than expected. From this was born a working group of 3 people, each with their own expertise. We studied several solutions extensively, one of them was to simply add Vault as a new Trocla backend. As we did’nt have a ruby developper or a lot of expertise in the subject, it was not easy to accept a bad custom code on such a central element. It was decided to do the work and bring it to the community in order to have a code review. After that, the project was split into 3 main points and we each took the lead on one of them.

A first person with a senior lead tech profile, who participated on the initial Trocla implementation, had already taken the Vault subjet. First, you have to set up the architecture. We will start with a relatively simple Vault architecture with 2 instances balanced by Haproxy locally and data storage in the backend on an internal consul cluster. Eventually a migration to Kubernetes is considered. Auto-unseal is also in place with another Vault instance which is in a public cloud provider. Finally, performance and load tests are carried out to ensure that everything is ready to accommodate the Puppet traffic (which we had trouble evaluating to be honest).
An OPS from the Run engineer pool, who wished to level up his skillson the subject, will be in charge of setting up the data migration fro MySQL to Vault. For that we need to get the current content from Trocla, for the encrypted x509 part, and copy the data with the lib vault-ruby. The algorithm was as follows:
1. Listing all keys
2. For each key, listing formats
3. For each format, getting the content and building a new data hash for Vault with formats + secrets
4. Push the key to a Vault kv v2
And finally me, who will be in charge of writing the Ruby code of the new Trocla backend, the use with Puppet, the Trocla Api and Ansible interraction. So I have proposed two merge requests:
- #61 to add the Vault store integration, which was enthusiastically received by the maintainer duritong
- #68 to add the secret destroy (content + metadatas)

As for the action plan, it was very simple, but required the interruption of operations for a day (we took a non-working day).

Set up a empty kv
Stop all Puppet agents so as not to modify production passwords during the operation
Run the migrations data script
Test and validate with some nodes
Re-enable all Puppet agents and watch every anomaly

Vault token management

This was one of the most complex points for all of us. All Vault security access is based on token management, and the ideal solution is to have the shortest in time and the least used tokens possible. But it is not very compatible for DSL use compiled by Puppetservers. Indeed, the configuration and the initialization of trocla is done during the creation of the jruby instances, and it was unthinkable to have to re-make tokens at each call.

So, we chose to create a service token with the following security :

Restriction by cidrs of our Puppetservers and API
A token policy on a dedicated kv
A 24h ttl, infinitely renewed

With that we need to created a token ttl rotation which I’ve set on the Puppetservers (which has the token service on the trocla configuration).

#!/opt/puppetlabs/puppet/bin/ruby
require 'yaml'
require 'vault'

lock='/tmp/trocla-token-renew.lock'

raise format('Lock file %s exist', lock) if File.exist?(lock)
File.open(lock, 'w') {}

troclarc = YAML.load_file('/etc/troclarc.yaml')
vault = Vault::Client.new(
  troclarc.delete('store_options').reject { |k, _| k == :mount }
)
vault.auth_token.renew_self

File.delete(lock) if File.exist?(lock)

And after I just need to create a systemd timer with Puppet:

$trocla_renew = '/usr/local/sbin/trocla-token-renew'

file { 'trocla-token-renew':
  ensure => file,
  path   => $trocla_renew,
  owner  => 'root',
  group  => 'root',
  mode   => '0500',
  source => "puppet:///modules/${module_name}/trocla/trocla-token-renew",
}

-> systemd::unit_file { 'trocla-token-renew.service':
  ensure  => 'present',
  content => template("${module_name}/systemd/trocla-token-renew.service.erb"),
}
-> systemd::unit_file { 'trocla-token-renew.timer':
  ensure  => 'present',
  content => template("${module_name}/systemd/trocla-token-renew.timer.erb"),
  enable  => true,
  active  => true,
}

trocla-token-renew.timer.erb:

[Unit]
Description=Timer to run the vault token renew

[Timer]
OnBootSec=15min
OnCalendar=*-*-* <%= format('%02d', @hour) %>:<%= format('%02d', @minute) %>:00

[Install]
WantedBy=timers.target

trocla-token-renew.service.erb:

[Unit]
Description=Script to renew the trocla vault token
ConditionFileIsExecutable=<%= @trocla_renew %>
After=network.target

[Service]
Type=oneshot
ExecStart=<%= @trocla_renew %>
Nice=10
IOSchedulingClass=best-effort
IOSchedulingPriority=5

[Install]
WantedBy=multi-user.target

Migration

The day of the migration everthing goes well. A few keys with specific cases had been identified. From a key with a special character like / or with too long key names, we have treated them on a case-by-case basis. The import ended the same way as the preproduction (in about 20min), but unfortunately when testing the content on a few nodes, we quickly saw that there was something wrong. Some secrets change. It didn’t take long to realize that during the migration we had imported the pre-production data set, from a backup of several weeks.

The setback was anticipated in the action plan, but during the destruction of the kv we had a problem. The kv had to much content and failed to destroy itself correctly. It was after many searches on Github issues that we found the solution. We changed the value of the default_max_request_duration parameter so that consul finally has time to delete all the contents of the kv.

The import was then redone correctly and we were able to finish the migration.

Feedbacks

After the migrations we were able to extract more metrics from our Trocla infrastructure. It is with surprise that we observed 2M calls per day on the Vault kv (for 10M calls per day from the Puppet API) and that the integration of an additional https layer had no effect on compiling Puppet catalogs. On the contrary, Vault turned out to be a better backend than MySQL (we are talking about a few ms of gain).

I then added the vault key expiration implementation on merge request #71 (with a little fix in #80), allowing us to have secrets renewal in a very simple way while keeping an accessible history through the Vault webui.

Finally, unfortunately, we are still waiting for our cases of Trocla Nera wine from our (old) managers.