Glossary · technical

Common Crawl (CCBot)

Non-profit project (founded 2007) operating a large-scale web crawl whose data feeds many LLM training corpora (GPT-3 / GPT-4 / Llama / Claude training data has all included Common Crawl-derived datasets). Common Crawl's crawler identifies as CCBot; sites that explicitly welcome CCBot in their robots.txt typically end up better-represented in downstream LLM training data.

Full glossary index (70)

All terms in the Step Secrets Editorial Glossary. Each is a standalone reference page.

Related terms

Full glossary index (70)

This is an adult site