More Keys and passwords out there? Now they’re in training sets for LLMs

OK, I know it has been a crazy few days with the news of Skype and other work comitments, but I now have some time to start blogging once again.

This article is actually very crazy.

Why areAPI keys part of LLM training data? Can someone explain that to me?

Also, can someone else tell me why passwords are in the mix? And can you tell me why the hell this is an open database that is searchable?

The number this time is 12,000 keys and passwords, i don’t know if that’s a total number or a combination between the two.

Close to 12,000 valid secrets that include API keys and passwords have been found in the Common Crawl dataset used for training multiple artificial intelligence models.

The Common Crawl non-profit organization maintains a massive open-source repository of petabytes of web data collected since 2008 and is free for anyone to use.

link to common crawl

Because of the large dataset, many artificial intelligence projects may rely, at least in part, on the digital archive for training large language models (LLMs), including ones from OpenAI, DeepSeek, Google, Meta, Anthropic, and Stability.

The article continues

Researchers at Truffle Security – the company behind the TruffleHog open-source scanner for sensitive data, found valid secrets after checking 400 terabytes of data from 2.67 billion web pages in the Common Crawl December 2024 archive.

They discovered 11,908 secrets that authenticate successfully, which developers hardcoded, indicating the potential of LLMs being trained on insecure code.

The article does continue to talk about the fact that they try to get out all of the data being possibly removed, but it is hard.

It should be noted that LLM training data is not used in raw form and goes through a pre-processing stage that involves cleaning and filtering out unnecessary content like irrelevant data, duplicate, harmful, or sensitive information.

Despite such efforts, it is difficult to remove confidential data, and the process offers no guarantee for stripping such a large dataset of all personally identifiable information (PII), financial data, medical records, and other sensitive content.

Are you serious? Here’s more.

Overall, TruffleHog identified 219 distinct secret types in the Common Crawl dataset, the most common being MailChimp API keys.

“Nearly 1,500 unique Mailchimp API keys were hard coded in front-end HTML and JavaScript” – Truffle Security

The researchers explain that the developers’ mistake was to hardcode them into HTML forms and JavaScript snippets and did not use server-side environment variables.

Even Slack was part of the data too. Nearly 12,000 API keys and passwords found in AI training dataset is the full article. Have fun with it.


Discover more from The Technology blog and podcast

Subscribe to get the latest posts sent to your email.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.