Q: Is an agent's policy automatically re-enabled upon agent restart (i.e. machine reboot)?


Aaron Young
 


 Hello, I am new to Keylime and had a question about the behavior when an agent /machine is rebooted/restarted.

 I have performed the following test:

1. Installed Keylime such that the registrar/verifier are running on a system. On another system the agent is started.
2. I then used keylime_tenant to install a policy for the agent. The keylime system begins periodically successfully monitoring the agent/system (returning code 200 periodically every few seconds). All is good at this point.

3. The agent system is then rebooted
4. After the agent system comes back up and agent restarts, I can see the agent successfully re-connect to the registrar BUT the previous policy is NOT automatically re-installed for the agent. i.e. I have to run keylime_tenant again to get the policy re-installed.

Is this the proper behaivor? I expected that the policy would automatically get re-installed upon agent restart.

Did I mis-configure something perhaps?

Here are the messages I get from the registrar/verifier - where you can see it get successful 200 response from monitor, lose connection to the agent due to the reboot, stop polliing the agent, the agent reconnect/re-register after the system reboots:

----------
keylime.registrar-common - INFO - GET returning 200 response for agent_id:D432FBB3-D2F1-4A97-9EF7-75BD81C00000
2020-04-06 15:13:58.997 - keylime.tpm2 - INFO - TPM2-TOOLS Version: 3.2.2[root@ajyoung-testsys1 Test]# 2020-04-06 15:15:21.730 - keylime.cloudverifier - INFO - connection to 10.149.224.212 refused after 1/10 tries, trying again in 1.000000 seconds
2020-04-06 15:15:22.863 - keylime.cloudverifier - INFO - connection to 10.149.224.212 refused after 2/10 tries, trying again in 1.000000 seconds
2020-04-06 15:15:23.999 - keylime.cloudverifier - INFO - connection to 10.149.224.212 refused after 3/10 tries, trying again in 1.000000 seconds
2020-04-06 15:15:45.021 - keylime.cloudverifier - CRITICAL - Unexpected Get Quote response error for cloud agent D432FBB3-D2F1-4A97-9EF7-75BD81C00000, Error: 500
2020-04-06 15:15:45.033 - keylime.cloudverifier - WARNING - agent D432FBB3-D2F1-4A97-9EF7-75BD81C00000 failed, stopping polling
2020-04-06 15:17:51.572 - keylime.tpm2 - INFO - Encrypting AIK for UUID D432FBB3-D2F1-4A97-9EF7-75BD81C00000
2020-04-06 15:17:51.578 - keylime.registrar-common - INFO - Overwriting previous registration for this UUID.
2020-04-06 15:17:51.598 - keylime.registrar-common - INFO - POST returning key blob for agent_id: D432FBB3-D2F1-4A97-9EF7-75BD81C00000
2020-04-06 15:17:52.759 - keylime.registrar-common - INFO - PUT activated: D432FBB3-D2F1-4A97-9EF7-75BD81C00000

-------

Thanks for any help.

-Aaron Young


Luke A Hinds
 

Hi Arron,

it's more likely that the verifier would have timed out the agent. However, let me say up front - this is a very valid case you have outlined, and so we could look at amending this behavior if it makes sense.

There are two junctures of registration within Keylime. The first is between the Agent and the Registrar. This sets up a UUID entry and stores the EKpub / AKpub within the registrar. The machine has not been attested yet, we don't trust it, but it's now positioned to be "verified" by the Verifier.

The second part as said, is where we establish the trust state of the machine. and this is between the Agent and the Verifier (with the tenant involved initially to provide an extra verification point in case someone has tampered with our Verifier instance). The verifier will request a TPM quote from the agent and then ensure a root of trust is present by querying the registrar who has the AKpub (that ensures the quote is not tampered with). Typically what you will see is the verifier polling with "Checking IMA measurement list on agent: <UUID>" and the Agent reply to the request with an integrity quote. If the agent is killed at this point, the verifier will perform a 10 times retry to eventual fail.

There are also the following values you can toggle in /etc/keylime.conf on the verifier:


# How long to wait between failed attempts to connect to a cloud agent in
# seconds.  floating point values accepted here
retry_interval = 1

# Integer number of retries to connect to an agent before giving up
max_retries = 10

You could try setting max retires to a much higher number and perhaps bump the retry interval up to say 10

This might well meet your needs. The verifier to Agent http requests are asynchronous / non blocking so you don't need to worry about  bottlenecks (within reason of course :)

Let us know how you get on, you can also jump on gitter as well if you need some help.

We can also recommend other use cases, for example delivery of an encrypted payload should a machine prove its trust.

Cheers,

Luke



On Tue, Apr 7, 2020 at 7:10 PM <aaron.young@...> wrote:


 Hello, I am new to Keylime and had a question about the behavior when an agent /machine is rebooted/restarted.

 I have performed the following test:

1. Installed Keylime such that the registrar/verifier are running on a system. On another system the agent is started.
2. I then used keylime_tenant to install a policy for the agent. The keylime system begins periodically successfully monitoring the agent/system (returning code 200 periodically every few seconds). All is good at this point.

3. The agent system is then rebooted
4. After the agent system comes back up and agent restarts, I can see the agent successfully re-connect to the registrar BUT the previous policy is NOT automatically re-installed for the agent. i.e. I have to run keylime_tenant again to get the policy re-installed.

Is this the proper behaivor? I expected that the policy would automatically get re-installed upon agent restart.

Did I mis-configure something perhaps?

Here are the messages I get from the registrar/verifier - where you can see it get successful 200 response from monitor, lose connection to the agent due to the reboot, stop polliing the agent, the agent reconnect/re-register after the system reboots:

----------
keylime.registrar-common - INFO - GET returning 200 response for agent_id:D432FBB3-D2F1-4A97-9EF7-75BD81C00000
2020-04-06 15:13:58.997 - keylime.tpm2 - INFO - TPM2-TOOLS Version: 3.2.2[root@ajyoung-testsys1 Test]# 2020-04-06 15:15:21.730 - keylime.cloudverifier - INFO - connection to 10.149.224.212 refused after 1/10 tries, trying again in 1.000000 seconds
2020-04-06 15:15:22.863 - keylime.cloudverifier - INFO - connection to 10.149.224.212 refused after 2/10 tries, trying again in 1.000000 seconds
2020-04-06 15:15:23.999 - keylime.cloudverifier - INFO - connection to 10.149.224.212 refused after 3/10 tries, trying again in 1.000000 seconds
2020-04-06 15:15:45.021 - keylime.cloudverifier - CRITICAL - Unexpected Get Quote response error for cloud agent D432FBB3-D2F1-4A97-9EF7-75BD81C00000, Error: 500
2020-04-06 15:15:45.033 - keylime.cloudverifier - WARNING - agent D432FBB3-D2F1-4A97-9EF7-75BD81C00000 failed, stopping polling
2020-04-06 15:17:51.572 - keylime.tpm2 - INFO - Encrypting AIK for UUID D432FBB3-D2F1-4A97-9EF7-75BD81C00000
2020-04-06 15:17:51.578 - keylime.registrar-common - INFO - Overwriting previous registration for this UUID.
2020-04-06 15:17:51.598 - keylime.registrar-common - INFO - POST returning key blob for agent_id: D432FBB3-D2F1-4A97-9EF7-75BD81C00000
2020-04-06 15:17:52.759 - keylime.registrar-common - INFO - PUT activated: D432FBB3-D2F1-4A97-9EF7-75BD81C00000

-------

Thanks for any help.

-Aaron Young



--
Luke Hinds: Security Strategy  | Office of the CTO | Red Hat
e: lhinds@... | t: +44 12 52 36 2483


Aaron Young
 


 Thanks for the response. Do you think that a possible easy fix (or workaround) to this issue would be to allow an infinite timeout value for:

# Integer number of retries to connect to an agent before giving up
max_retries = 10

 i.e. maybe allow set to -1 to specify infinite?

-Aaron


Luke A Hinds
 



On Tue, Apr 7, 2020 at 7:59 PM <aaron.young@...> wrote:


 Thanks for the response. Do you think that a possible easy fix (or workaround) to this issue would be to allow an infinite timeout value for:

# Integer number of retries to connect to an agent before giving up
max_retries = 10

 i.e. maybe allow set to -1 to specify infinite?


I figure that should work, give it try and let us know if it operates as expected.


-Aaron



--
Luke Hinds: Security Strategy  | Office of the CTO | Red Hat
e: lhinds@... | t: +44 12 52 36 2483


Aaron Young
 

> I figure that should work, give it try and let us know if it operates as expected.
 
I tried the following code change (below) real quick, but unfortunately it didn't work.

With this code change, the agent no longer timed out, but after the agent re-registered after reboot,  tpm2_checkquote errors were reported (and the agent went into failed state).

NOTE: the additional check for response.status_code==500 which is the status I got back when rebooting the agent machine.

 diff --git a/keylime/cloud_verifier_tornado.py b/keylime/cloud_verifier_tornado.py
index ac2b9e7..4c680ed 100644
--- a/keylime/cloud_verifier_tornado.py
+++ b/keylime/cloud_verifier_tornado.py
@@ -307,7 +307,7 @@ class AgentsHandler(BaseHandler):
 
         if response.status_code !=200:
             # this is a connection error, retry get quote
-            if response.status_code == 599:
+            if response.status_code == 599 or response.status_code == 500:
                 asyncio.ensure_future(self.process_agent(agent, cloud_verifier_common.CloudAgent_Operational_Sta
             else:
                 #catastrophic error, do not continue
@@ -426,7 +426,7 @@ class AgentsHandler(BaseHandler):
             retry = config.getfloat('cloud_verifier','retry_interval')
             if main_agent_operational_state == cloud_verifier_common.CloudAgent_Operational_State.GET_QUOTE and
                 new_operational_state == cloud_verifier_common.CloudAgent_Operational_State.GET_QUOTE_RETRY:
-                if agent['num_retries']>=maxr:
+                if maxr!=-1 and agent['num_retries']>=maxr:
                     logger.warning("agent %s was not reachable for quote in %d tries, setting state to FAILED"%(
                     if agent['first_verified']: # only notify on previously good agents
                         cloud_verifier_common.notifyError(agent,'comm_error')


 I didn't look into it any further...

 -Aaron


Luke A Hinds
 



On Wed, Apr 8, 2020 at 11:28 PM <aaron.young@...> wrote:

> I figure that should work, give it try and let us know if it operates as expected.
 
I tried the following code change (below) real quick, but unfortunately it didn't work.

With this code change, the agent no longer timed out, but after the agent re-registered after reboot,  tpm2_checkquote errors were reported (and the agent went into failed state).


Were you measuring anything at all with `tpm_policy`?

I will try to find time to look at this, in the meantime would you mind raising an issue in github. That way we can either produce fix or consider a change that might be needed.

@Munson, Charles - 0553 - MITLL do you have any ideas and have tried this scenario out before?

NOTE: the additional check for response.status_code==500 which is the status I got back when rebooting the agent machine.

 diff --git a/keylime/cloud_verifier_tornado.py b/keylime/cloud_verifier_tornado.py
index ac2b9e7..4c680ed 100644
--- a/keylime/cloud_verifier_tornado.py
+++ b/keylime/cloud_verifier_tornado.py
@@ -307,7 +307,7 @@ class AgentsHandler(BaseHandler):
 
         if response.status_code !=200:
             # this is a connection error, retry get quote
-            if response.status_code == 599:
+            if response.status_code == 599 or response.status_code == 500:
                 asyncio.ensure_future(self.process_agent(agent, cloud_verifier_common.CloudAgent_Operational_Sta
             else:
                 #catastrophic error, do not continue
@@ -426,7 +426,7 @@ class AgentsHandler(BaseHandler):
             retry = config.getfloat('cloud_verifier','retry_interval')
             if main_agent_operational_state == cloud_verifier_common.CloudAgent_Operational_State.GET_QUOTE and
                 new_operational_state == cloud_verifier_common.CloudAgent_Operational_State.GET_QUOTE_RETRY:
-                if agent['num_retries']>=maxr:
+                if maxr!=-1 and agent['num_retries']>=maxr:
                     logger.warning("agent %s was not reachable for quote in %d tries, setting state to FAILED"%(
                     if agent['first_verified']: # only notify on previously good agents
                         cloud_verifier_common.notifyError(agent,'comm_error')


 I didn't look into it any further...

 -Aaron



--
Luke Hinds: Security Strategy  | Office of the CTO | Red Hat
e: lhinds@... | t: +44 12 52 36 2483


Munson, Charles - 0553 - MITLL <Charles.Munson@...>
 

Can you tell if there are any errors while attempting to store the tenant’s key half in the agent TPM’s NVRAM? 

 

Notably, this occurs here: https://github.com/keylime/keylime/blob/master/keylime/tpm2.py#L1174

 

The tenant’s key half should be stored in 0x1500018 (maybe you can check the TPM to see if that is true/successfuly, both before and after reboot?)

 

Best,

Charlie

 

 

From: Luke Hinds <lhinds@...>
Sent: Thursday, April 9, 2020 2:34 AM
To: main@keylime.groups.io; Munson, Charles - 0553 - MITLL <Charles.Munson@...>
Subject: Re: [keylime] Q: Is an agent's policy automatically re-enabled upon agent restart (i.e. machine reboot)?

 

 

 

On Wed, Apr 8, 2020 at 11:28 PM <aaron.young@...> wrote:

> I figure that should work, give it try and let us know if it operates as expected.
 
I tried the following code change (below) real quick, but unfortunately it didn't work.

With this code change, the agent no longer timed out, but after the agent re-registered after reboot,  tpm2_checkquote errors were reported (and the agent went into failed state).

 

Were you measuring anything at all with `tpm_policy`?

 

I will try to find time to look at this, in the meantime would you mind raising an issue in github. That way we can either produce fix or consider a change that might be needed.

 

@Munson, Charles - 0553 - MITLL do you have any ideas and have tried this scenario out before?

 

NOTE: the additional check for response.status_code==500 which is the status I got back when rebooting the agent machine.

 diff --git a/keylime/cloud_verifier_tornado.py b/keylime/cloud_verifier_tornado.py
index ac2b9e7..4c680ed 100644
--- a/keylime/cloud_verifier_tornado.py
+++ b/keylime/cloud_verifier_tornado.py
@@ -307,7 +307,7 @@ class AgentsHandler(BaseHandler):
 
         if response.status_code !=200:
             # this is a connection error, retry get quote
-            if response.status_code == 599:
+            if response.status_code == 599 or response.status_code == 500:
                 asyncio.ensure_future(self.process_agent(agent, cloud_verifier_common.CloudAgent_Operational_Sta
             else:
                 #catastrophic error, do not continue
@@ -426,7 +426,7 @@ class AgentsHandler(BaseHandler):
             retry = config.getfloat('cloud_verifier','retry_interval')
             if main_agent_operational_state == cloud_verifier_common.CloudAgent_Operational_State.GET_QUOTE and
                 new_operational_state == cloud_verifier_common.CloudAgent_Operational_State.GET_QUOTE_RETRY:
-                if agent['num_retries']>=maxr:
+                if maxr!=-1 and agent['num_retries']>=maxr:
                     logger.warning("agent %s was not reachable for quote in %d tries, setting state to FAILED"%(
                     if agent['first_verified']: # only notify on previously good agents
                         cloud_verifier_common.notifyError(agent,'comm_error')


 I didn't look into it any further...

 -Aaron



--

Luke Hinds: Security Strategy  | Office of the CTO | Red Hat
e: 
lhinds@... | t: +44 12 52 36 2483


Luke A Hinds
 

Hi Charlie,

I guess this would only be the case for a HW tpm, could we test this with the emulator. I figure as long as NVChip is not delete and the tpm_server is not started with the clear state flag?

Luke

On Mon, Apr 13, 2020 at 10:46 PM Munson, Charles - 0553 - MITLL <Charles.Munson@...> wrote:

Can you tell if there are any errors while attempting to store the tenant’s key half in the agent TPM’s NVRAM? 

 

Notably, this occurs here: https://github.com/keylime/keylime/blob/master/keylime/tpm2.py#L1174

 

The tenant’s key half should be stored in 0x1500018 (maybe you can check the TPM to see if that is true/successfuly, both before and after reboot?)

 

Best,

Charlie

 

 

From: Luke Hinds <lhinds@...>
Sent: Thursday, April 9, 2020 2:34 AM
To: main@keylime.groups.io; Munson, Charles - 0553 - MITLL <Charles.Munson@...>
Subject: Re: [keylime] Q: Is an agent's policy automatically re-enabled upon agent restart (i.e. machine reboot)?

 

 

 

On Wed, Apr 8, 2020 at 11:28 PM <aaron.young@...> wrote:

> I figure that should work, give it try and let us know if it operates as expected.
 
I tried the following code change (below) real quick, but unfortunately it didn't work.

With this code change, the agent no longer timed out, but after the agent re-registered after reboot,  tpm2_checkquote errors were reported (and the agent went into failed state).

 

Were you measuring anything at all with `tpm_policy`?

 

I will try to find time to look at this, in the meantime would you mind raising an issue in github. That way we can either produce fix or consider a change that might be needed.

 

@Munson, Charles - 0553 - MITLL do you have any ideas and have tried this scenario out before?

 

NOTE: the additional check for response.status_code==500 which is the status I got back when rebooting the agent machine.

 diff --git a/keylime/cloud_verifier_tornado.py b/keylime/cloud_verifier_tornado.py
index ac2b9e7..4c680ed 100644
--- a/keylime/cloud_verifier_tornado.py
+++ b/keylime/cloud_verifier_tornado.py
@@ -307,7 +307,7 @@ class AgentsHandler(BaseHandler):
 
         if response.status_code !=200:
             # this is a connection error, retry get quote
-            if response.status_code == 599:
+            if response.status_code == 599 or response.status_code == 500:
                 asyncio.ensure_future(self.process_agent(agent, cloud_verifier_common.CloudAgent_Operational_Sta
             else:
                 #catastrophic error, do not continue
@@ -426,7 +426,7 @@ class AgentsHandler(BaseHandler):
             retry = config.getfloat('cloud_verifier','retry_interval')
             if main_agent_operational_state == cloud_verifier_common.CloudAgent_Operational_State.GET_QUOTE and
                 new_operational_state == cloud_verifier_common.CloudAgent_Operational_State.GET_QUOTE_RETRY:
-                if agent['num_retries']>=maxr:
+                if maxr!=-1 and agent['num_retries']>=maxr:
                     logger.warning("agent %s was not reachable for quote in %d tries, setting state to FAILED"%(
                     if agent['first_verified']: # only notify on previously good agents
                         cloud_verifier_common.notifyError(agent,'comm_error')


 I didn't look into it any further...

 -Aaron



--

Luke Hinds: Security Strategy  | Office of the CTO | Red Hat
e: 
lhinds@... | t: +44 12 52 36 2483



--
Luke Hinds: Security Strategy  | Office of the CTO | Red Hat
e: lhinds@... | t: +44 12 52 36 2483


Kenneth Goldman
 

It depends on the emulator and version.

I recall that the original Microsoft implementation 'remanufactured' the TPM when the process started. It may be different now. The IBM implementation retains the TPM NV state in NVChip unless the -rm (remove, remanufacture) option is used.

Of course, like the HW TPM, powerup and startup affect the state.

--
Ken Goldman kgoldman@...
914-945-2415 (862-2415)


"Luke A Hinds" ---04/18/2020 02:29:50 PM---Hi Charlie, I guess this would only be the case for a HW tpm, could we test this with

From: "Luke A Hinds" <lhinds@...>
To: "Munson, Charles - 0553 - MITLL" <Charles.Munson@...>
Cc: "main@keylime.groups.io" <main@keylime.groups.io>
Date: 04/18/2020 02:29 PM
Subject: [EXTERNAL] Re: [keylime] Q: Is an agent's policy automatically re-enabled upon agent restart (i.e. machine reboot)?
Sent by: main@keylime.groups.io





Hi Charlie,

I guess this would only be the case for a HW tpm, could we test this with the emulator. I figure as long as NVChip is not delete and the tpm_server is not started with the clear state flag?

_,_._,_



Luke A Hinds
 

Thanks Ken, that's what I thought and we are using the ibm tpm (1119).

I will try to test this and we can plan out a means for the agent to recommence verification once it's back online again.


On Sat, Apr 18, 2020 at 8:54 PM Kenneth Goldman <kgoldman@...> wrote:

It depends on the emulator and version.

I recall that the original Microsoft implementation 'remanufactured' the TPM when the process started. It may be different now. The IBM implementation retains the TPM NV state in NVChip unless the -rm (remove, remanufacture) option is used.

Of course, like the HW TPM, powerup and startup affect the state.

--
Ken Goldman kgoldman@...
914-945-2415 (862-2415)


"Luke A Hinds" ---04/18/2020 02:29:50 PM---Hi Charlie, I guess this would only be the case for a HW tpm, could we test this with

From: "Luke A Hinds" <lhinds@...>
To: "Munson, Charles - 0553 - MITLL" <Charles.Munson@...>
Cc: "main@keylime.groups.io" <main@keylime.groups.io>
Date: 04/18/2020 02:29 PM
Subject: [EXTERNAL] Re: [keylime] Q: Is an agent's policy automatically re-enabled upon agent restart (i.e. machine reboot)?
Sent by: main@keylime.groups.io





Hi Charlie,

I guess this would only be the case for a HW tpm, could we test this with the emulator. I figure as long as NVChip is not delete and the tpm_server is not started with the clear state flag?

_,_._,_




--
Luke Hinds: Security Strategy  | Office of the CTO | Red Hat
e: lhinds@... | t: +44 12 52 36 2483


Luke A Hinds
 

I have created an enhancement proposal for this work:


I am also (as you would have noticed) introducing a new means of proposing significant changes in Keylime. This is loosely based on Kubenetes Enhancement Proposals (KSP). It means we can review , track and gather enhancement proposals in one single location. So this will replace the old system of using Google Docs (which is not fair on those who do not wish to use google based tools).

If you have an existing enhancement proposed still in flight, please port over to the new format.

If you see any improvements that could be made to this new enhancement proposal , you're welcome of course to make a pull request.

More details are on the README:


It's a sunday morning effort, so go easy on the mistakes :)

On Sat, Apr 18, 2020 at 9:02 PM Luke A Hinds <lhinds@...> wrote:
Thanks Ken, that's what I thought and we are using the ibm tpm (1119).

I will try to test this and we can plan out a means for the agent to recommence verification once it's back online again.

On Sat, Apr 18, 2020 at 8:54 PM Kenneth Goldman <kgoldman@...> wrote:

It depends on the emulator and version.

I recall that the original Microsoft implementation 'remanufactured' the TPM when the process started. It may be different now. The IBM implementation retains the TPM NV state in NVChip unless the -rm (remove, remanufacture) option is used.

Of course, like the HW TPM, powerup and startup affect the state.

--
Ken Goldman kgoldman@...
914-945-2415 (862-2415)


"Luke A Hinds" ---04/18/2020 02:29:50 PM---Hi Charlie, I guess this would only be the case for a HW tpm, could we test this with

From: "Luke A Hinds" <lhinds@...>
To: "Munson, Charles - 0553 - MITLL" <Charles.Munson@...>
Cc: "main@keylime.groups.io" <main@keylime.groups.io>
Date: 04/18/2020 02:29 PM
Subject: [EXTERNAL] Re: [keylime] Q: Is an agent's policy automatically re-enabled upon agent restart (i.e. machine reboot)?
Sent by: main@keylime.groups.io





Hi Charlie,

I guess this would only be the case for a HW tpm, could we test this with the emulator. I figure as long as NVChip is not delete and the tpm_server is not started with the clear state flag?

_,_._,_




--
Luke Hinds: Security Strategy  | Office of the CTO | Red Hat
e: lhinds@... | t: +44 12 52 36 2483



--
Luke Hinds: Security Strategy  | Office of the CTO | Red Hat
e: lhinds@... | t: +44 12 52 36 2483