Datadog has a security footgun
Bad API design leads to field day for domain-snatching data siphons
I’ve just finished dealing with an incident at Corporate Clash (a fan reimagination of the defunct Disney MMO, Toontown Online) where a misconfigured Datadog RUM (Real User Monitoring for user sessions tracking) config led to customer data being sent to an unauthorized 3rd party for a week, whenever a user visited our website at https://corporateclash.net.
The data, sent by the @datadog/browser-rum package, contained information such as:
your device make/model
location based on your IP address
OS/Browser versions
approximate internet speed
a replay of how users interacted with the website (clicks, scrolls, etc)
The set of data is comparable to if you took a wrong turn on the internet and visited a site you didn’t mean to visit so the degree of harm is minor, but since this data transmission happened automatically when our users visited our site, we decided to disclose the incident as a security incident.1 Fortunately, we don’t think the domain parking service that picked up the domain was aware of the data, as they likely only went after the NXDOMAINs and weren’t interested in the extra Datadog data being transmitted.
I can’t help but feel Datadog is negligent in their API design that made the incident possible. Since I emailed them and they don’t consider this to be a security vulnerability, I will talk about how it works and how you can take advantage of this by being a savvy domain-snatcher avoid this by reading Datadog’s document very, very, very carefully.
So, what’s a Site anyway?
If I ask you what a “site” is, what would you answer me with? A website? A physical location? If you were given this piece of code with no other context, what do you think a “site” means here?
datadogRum.init({ applicationId: 'APP_ID', clientToken: 'CLIENT_ID', site: 'corporateclash.net', version: "v1.0.0", env: "production", service: "website", });
A normal person would probably assume a site is a website, perhaps a domain. If you ask Datadog though:
Datadog offers different sites throughout the world. Each site is completely independent, and you cannot share data across sites. Each site gives you benefits (for example, government security regulations) or allows you to store your data in specific locations around the world.
So… what literally everybody else calls a “region”. And this became the start of the footgun.
If you read the Datadog RUM documentation, site
looks like a required parameter we have to fill in. It’s actually not! The package defaults to the US1 site
if you don’t have one filled in, which we relied on as default since the introduction of the package, as we also happen to be on the US1 site
. One of our developers, when upgrading the Datadog RUM package, noticed the missing site
parameter. They made the mistake of thinking it meant “our website domain”, which I hope you agree is a very easy mistake to make, and put in our website domain, corporateclash.net
, instead of the correct value, datadoghq.com
.
Datadog uses the site
parameter to figure out the browser intake domain, the domain to send the RUM data to. Here’s the full function source code2, but let’s break it down here.
It first reads from the configuration what site
should be; if site
is empty, assign it to INTAKE_SITE_US1, a constant equalling datadoghq.com
. This is the behaviour we were inadvertently relying on and why RUM wasn’t broken before we made the changes.
function buildEndpointHost(trackType: TrackType, initConfiguration: InitConfiguration & { usePciIntake?: boolean }) {
const { site = INTAKE_SITE_US1, internalAnalyticsSubdomain } = initConfiguration
If we’re tracking logs and we require PCI Compliance and we are on site
US1, use PCI_INTAKE_HOST_US1, equalling pci.browser-intake-datadoghq.com
. (Spotted anything fishy yet?)
if (trackType === 'logs' && initConfiguration.usePciIntake && site === INTAKE_SITE_US1) {
return PCI_INTAKE_HOST_US1
}
If we’re tracking internal analytics (for Datadog internal use?), send the data to a subdomain of INTAKE_SITE_US1
.
if (internalAnalyticsSubdomain && site === INTAKE_SITE_US1) {
return `${internalAnalyticsSubdomain}.${INTAKE_SITE_US1}`
}
If we’re INTAKE_SITE_FED_STAGING (Government staging environment), send the data to a subdomain of INTAKE_SITE_FED_STAGING
.
if (site === INTAKE_SITE_FED_STAGING) {
return `http-intake.logs.${site}`
}
Finally, we don’t have a match, so let’s build the domain by splitting the domain, joining the domain parts with hyphens, put the domain extension at the end, then append browser-intake-
in front of the domain name…?
const domainParts = site.split('.')
const extension = domainParts.pop()
return `browser-intake-${domainParts.join('-')}.${extension!}`
Have you spotted the problem yet? Remember, the site
at this point is corporateclash.net
. It would not have hit any of the above branches and would fall straight here, and there is no sanity check on site
s that don’t belong to Datadog. So the code path happily builds the browser intake domain, which is….
browser-intake-corporateclash.net
.
A real domain, a domain we did not expect, and a domain we did not own.
A third party.
The change was put in place on 24th December, 2023, and since nobody expected the domain to exist, the request silently failed for a few months before a domain parking company noticed the NXDOMAINs and registered the domain on 29th March, 2024. We were very lucky that somebody3 alerted us to the issue on 4th April, 2024, about a week after the domain’s registration. Had it not been spotted, we likely would not have noticed it and malicious actors might have had a chance to siphon our traffic data.4
We rectified the config error immediately, but 12 hours later, we made the decision to remove Datadog RUM from our website altogether. I don’t think I’m alone in the site
confusion either - I’m terrible at GitHub search but I found at least 1 example where another person thought DD_SITE meant something other than how Datadog defines sites.
How we could have prevented the incident on our part
This is where I admit we had failures on our part that could have prevented the incident. Namely, monitoring and CSP.
We should have monitored the rollout of the package after the change was pushed, both in terms of checking the Networking tab in the browser, and checking our RUM data, which would have blanked This would have been caught straight away, and given that it took months for domain scrapers to notice, it would not have blown up like this.
We should have implemented CSP. It would have prevented the footgun from manifesting in browsers because browser-intake-corporateclash.net
would not have been in the CSP allowlist. We have since implemented it on our website, but due to the age of the codebase, our CSP doesn’t offer complete protection against what CSPs can protect users from. We were already in the process of replacing the current website and work will accelerate.
Oh, and we should have read the docs.
How I would have prevented the incident on Datadog’s part
If I were designing this API, I would have done the following to make sure this cannot ever happen:
I would not call regions “sites”. Everybody in the industry understands the term “regions” to mean geographically-segregated instances of the service.
If we have to keep calling them “sites”, change the parameter from site to
ddSite
, so it’s harder to confuse for the current site name.If we have to keep the
site
parameter name (just to keep the existing instances secure), I would make sure you have to go out of your way to setsite
to an arbitrary domain. There are a few ways to do this:Only accept site codes (yes, they have those, such as
US1
,US5
, andAP1
) in thesite
parameter. Throw an exception and refuse to send data if incorrect. This will break BC.Only accept known-good
site
values, which are also very handily defined as constants. Throw an exception and refuse to send data if incorrect. This won’t be a BC break except for when rolling out new regions, in which case I don’t think it’s a stretch to tell people to upgrade their package.For internal development use (or if somebody has created a Datadog API-compatible service?), create a new parameter called
ddSiteOverrideDomain
which conveys several things: It is a Datadog-related parameter, it relates to Datadog sites, it is an override to the normal behaviour (danger alert!), and it expects a well-formed domain name.
Change the browser intake domain behaviour to always generate a subdomain of the
site
, instead of allowing it to generate arbitrary, unexpected domains. If the browser intake domain prefix had beenbrowser-intake.
(dot) instead ofbrowser-intake-
(hyphen), any arbitrarysite
value would result in a subdomain of the site (i.e.browser-intake.corporateclash.net
), not some random domain the programmer likely has no control over (i.e.browser-intake-corporateclash.net
). You are already doing it for internal analytics and government staging; why not here?
As a side note, I don’t think this setup is fooling any Ad Blockers since Privacy Badger is correctly blocking the (mind you, known-good) domain. Can somebody know why they aren’t using a subdomain?
Designing the API this way is a security footgun. It might not be a vulnerability in itself, but honest mistakes can turn it into one (remember Acropalypse?)
Datadog’s Response (tl;dr: It’s not our problem)
We reported the footgun to Datadog shortly after we figured out the culprit. After a weekend of waiting, we got the following response, which I will share in full followed by my commentary:
Thanks for your patience while we looked into this. In regards to the item you mentioned for the Datadog RUM SDK, we’re able to provide some additional information around your concerns:
To ensure the proper functioning and security of the Datadog Browser RUM SDK, Datadog provides preconfigured "site" values for customers to use during the initialization process of the SDK—more information about those defaults here.
Except the “site values” are never sanity-checked and look like actual website domains? If you want me to memorise it, at least make it something that doesn’t look like a domain, like us-east-1
.
As part of the shared responsibility between Datadog and our customers, it is the customer’s responsibility to leverage these configurations to maintain the integrity and security of their monitoring setup, as well as maintaining the integrity of code used on their systems.
This sounds like you are shrouding yourselves from bad API designs.
Modifying the "site" value provided by Datadog in the SDK initialization code could potentially result in data being sent to unintended domains. Therefore, it is important for customers to use the provided values to maintain a secure and reliable monitoring environment. The correlation of “site” values and intake endpoints can be found here.
But you can sanity-check this, so why don’t you? What do you have to gain from not sanity-checking this?
Being able to modify the “site” value would indicate that the user has access to the application code itself. As a result, they would also be able to override any validation in place.
That’s not what I said. I clearly spelled out in the email that “site operators” (so, like ourselves) can make this mistake. If a user wants to modify our source code to send their own RUM data to browser-intake-istealyourdata.com
, more power to them. But that’s not what the report is about.
Conclusion
Yes, we should have read the docs. Yes, we should have enabled CSP. Doing both would have saved us a ton of trouble. But I think Datadog isn’t helping customers with this footgun, and somebody else making the same mistake that we did can land them in very hot water.
Here’s my ask for Datadog to improve on their API design:
Stop calling your region “sites”.
Make it very hard to mistake Datadog sites for actual websites.
Change the browser-intake- (hyphen) datadoghq.com domain to browser-intake. (dot) datadoghq.com.
Treat API footguns as security vulnerabilities.
I hope this post helps someone avoid the footgun, and given my inability to bulk-check the existence of domains that start with browser-intake-
, I can only hope this has not been abused in the wild.
Disclaimers and Advertisements
As of 9th April, 2024 when this article was published, I own 0.125046 shares of Datadog, Inc through Trade Republic.
If you’ve played Toontown Online as a child, Corporate Clash is a non-profit free-to-play fan reimagination of the defunct Disney MMO, Toontown Online. We are on the lookout for all sorts of talent, technical or not, to join our all-volunteer team: https://corporateclash.net/help/apply
I am personally on the job market and am on the hunt for a Senior Full-stack Developer position; you can find my CV at https://cv.thelastcode.io.
We also notified the authorities of the incident, even if the incident, to the best of our knowledge, did not qualify as a “data breach” under US laws, where our legal entity is based.
For the sake of brevity I won’t list through the whole function call chain, but I verified the source code and I cannot see anywhere where the site
parameter is sanity-checked before ending up in this function.
While they were doing a cybersecurity course assignment, no less.
Funnily enough, the server that were accepting our false RUM data were not responding to the client’s requests because they had CORS headers in place. Of course, since the request did get made, we can’t be sure they didn’t actually process the data, so out of caution we treated the incident as if they had ingested our data and made it look like they didn’t.