whoami, who am I? Thoughts on protecting digital and human identities.

TL;DR: What is identity security, why we often do it wrong, and how we can get it right.

I started digging into identity and identity security concepts earlier this year in order to help my employer integrate more identity-based security controls and telemetry into its products. I’ve dabbled in identity for years, but I’d never formally studied it. Naturally, I started reading whitepapers, blogs, and websites, and learned a ton in the process. However, I also came away with the sense that we’re collectively making a lot of the same mistakes about identity.

In this article, I’m going to explore some existing definitions of identity, attempt to land on my own definition, and discuss where things can go right or wrong.

So, what the heck’s an identity?

How do we answer this question? Let’s start with the prior art. What does existing literature say an identity is? Here’s a pair of examples:

In their 2010 book Identity Management Concepts, Technologies, and Systems¹, Elisa Bertino and Kenji Takahashi defined identity as information about an entity that is sufficient to identify that entity in a particular context. Dr. Omondi Orondo, on the other hand, defines it in Identity & Access Management: A Systems Engineering Approach as a system representation (or abstraction) of a human being acting on the IAM system².

What happens when we look outside of the technology industry? Other disciplines study identity too! Dr. James Fearon published a draft paper in 1999 called What is identity (as we now use the word)?³, which summarizes many definitions of identity from across many publications. For example, Hogg and Abrams defined identity in 1988 as “people’s concepts of who they are, of what sort of people they are, and how they relate to others.” That won’t work for us infosec folks, because we need to incorporate machine identities that don’t (yet?) have a sense of self. In Dr. Fearon’s own words, “the range, complexity, and differences among these various formulations are remarkable.”

Here’s the thing: these definitions leave a lot to be desired. In the context of their respective textbooks or publications, they make sense and work, but they don’t generalize very well. Ultimately, they raised some questions for me, and with that in mind, I’d like to propose my own definition that answers those questions. The questions I have are:

Can we define identity without using the words identity or identify?
How do we account for non-human entities and their identities?
Is an identity really a one-to-one mapping, as these definitions imply?

The first question refers to tautological definitions. We’re defining identity with identity, or with identity and access management (IAM). We need a more fundamental definition.

The second question relates to the plethora of purely non-human identities we commonly deal with. How do we account for those? Lots of definitions simply don’t.

And finally, the third question relates to a common assumption among many authors, companies, and technologies: one entity has one identity, and you can affirmatively identify the entity with that… identity. This doesn’t hold up in practice.

We’ll do it live

Let’s make our own definition! How should we do that? Well, to start, we’ll actually need to define two terms. Keep in mind that my perspective is colored by the lens of defensive security research, so these definitions may not be applicable to all situations. Still, they’re what I try to keep in mind when I’m doing identity work.

Participant: any entity capable of acting upon or being acted upon by any other entity.
Identity: any set of information that fully describes a participant or set of participants in a particular context.

Those two definitions imply the following:

Participants are not all humans.
Participants may be entirely passive.
Participant-to-Identity can be a one-to-one, one-to-many, many-to-one, or many-to-many relationship, or anything in between.

Participants do stuff to other participants. This doing of stuff exposes information that can be used to identify participants and construct an identity for them.

How should we digitize identity?

Again, this is from a defensive security researcher’s perspective. When we talk about digitizing identity, we have to work with what we have. And what we have is security telemetry. So how do we digitize identity using security telemetry? Security telemetry actually gives us a lot of what we need to start digitizing identity.

We can collect the stuff participants do and the information exposed during these actions. The information exposed can be thought of as a passive identity or identifier, and the stuff participants do can be thought of as an active identity or identifier. With one or both—and inside of a specific context—we can use the information and actions exposed through security telemetry to fully describe participants or sets of participants. Whatever form this description takes? That’s your digitized identity.

What can we do with a digitized identity?

Now that we have our digitized identities, what can we actually do with them? For this to be a worthwhile endeavor, that digitized identity artifact needs to be useful. We can lean on the active-passive distinction to guide us here.

Active identities are for alerting (something you do)

If you want to get alerted when an identity does something it’s not supposed to do, active identities are your friend here. An active identity describes a participant by what it does. If you know what a participant is supposed to do (or supposed to not do), you can build and check active identities. You can think of an active identity as “User A frequently logs in to Server B.” These behaviors might trigger security alerts.

Passive identities are for baselining and discovery (something you are)

If you want to get more nebulous identity insights into a particular environment, passive identities will have your answers. You can think of a passive identity as “User A and User B typically work on the same projects.” I find it easier to explain this with questions, so here’s a few questions you might answer with passive identities.

What does my environment’s identity network look like?
Is this participant who we expect them to be?
What other participants does this participant routinely interact with?

Passive identities let you build fun things like identity network graphs and identity profiles, which you can combine with active identities to start asking and answering these questions and more.

Welcome to the [Identity] Machine

This is where machine learning (ML) comes into play. We can ask and answer a lot of good questions manually, or with heuristics, but at a certain scale,that becomes intractable. From my (again, biased) perspective, there are three main ways we can leverage machine learning to understand identities.

We can enrich identifying information to build meta-identities. For example, we can cluster and build identities for virtual teams within an organization that don’t follow the org chart, giving you ground truth on how people are working together.
We can contextualize information with relevant identifiers, whatever that may be. For example, if you’re reviewing a suspicious firewall alert (which, let’s be honest, is almost definitely a false positive), it might be helpful to know who is making that connection and if this falls in line with their expected behavior. Think of how many alerts you can filter out just by enriching them with identity information.
We can automate the asking and answering of difficult questions. With meta-identities and identity-enriched context, we’re in a good spot to automate a ton of identity-related questions and tasks. What that automation looks like depends on the identity problems you’re trying to solve, but I’m sure if you made it this far you can come up with something.

Show me

I’d like to close by putting it all together with a concrete example. We’ll ask a deceivingly simple question: when someone logs into a server, are they doing it from a place we expect? If you’ve worked with IP reputation, IP geolocation, and improbable travel modeling, you know trying to answer this manually or with heuristics can be tricky.

Well, using security telemetry alone, we can answer this question. The first step is to gather up all our active and passive identifiers and build out an identity network. This will include employees, but also workstations, servers, offices, service accounts, and so on. The different attributes or pieces of information in these active and passive identities will be the features for our machine learning pipeline. With these features, we can now model and solve this question as a machine learning problem.

Now, if you’ve worked with ML on these kinds of problems before, you’ll know that this alone will generate a ton of false positives. Anomaly detection algorithms detect anomalies, not evil. Anomaly detection doesn’t know what your goals are, it just knows what’s normal and what isn’t. This is where our meta-identities from the previous section can really help us out. By intelligently constructing our ML architecture and incorporating those enriched features, we can alert, for example, when an unexpected virtual team accesses a server, instead of an unexpected individual human. Alternatively, we can automatically cluster a group of servers that leverage service accounts to perform periodic tasks, even without an explicit organizational definition. This significantly cuts down on the noise generated by the ML model, and gives us a good artifact to automate.

With the automated ML pipeline done, we can expose this model as a question for humans to pull answers from during an investigation. When an analyst starts triaging this suspicious logon firewall alert (that triggered on some out of date, inaccurate, IP geolocation list), they can run it through the model. In addition to getting a classification out of the model (is this expected or not?), they can also dig into the meta-identities and other enriched information used.⁴

After a few iterations, the human analyst will probably have a good idea of what heuristics they need and what manual investigation they perform when they get these alerts, and the whole process (or at least a meaningful subset of the process) can be automated. We can extend this same model and/or process to other questions as well. Essentially, we are looking for various statistical outliers in different clusters. For example, based on their team, should a user be running a certain binary? Should they be reading a certain document? Do they typically send more external or internal emails? What do their inbox rules usually look like? Which teams are using what SaaS applications? Each of these might require independent modeling and analyst review.

You write too much. Summarize it for me?

No problem, lazy straw person! Everyone has their own definition of identity, based on their use case, so we’re not doing anything too radical by reframing identity to better help us solve security problems. From a security perspective, identities are highly contextual relationships between participants characterized by what they do to each other. Embracing identities in security telemetry by describing them as they are as opposed to prescribing them as we believe they should be gives us a lot of flexibility and power. We can take advantage of that flexibility and power to find evil, and protect users and customers.

This is a more traditional IAM book and a fairly easy read. Details on Google Books. ↩
It’s an interesting perspective that is worth checking out. You can find details about the book here. ↩
It’s apparently been in a draft since 1999, but you can read it here. ↩
If you’re interested in this concept in general, the keyword to search for is “AI explainability.” Great projects like AIX360 need more contributors. ↩