Glossary · technical

Common Crawl (CCBot)

Non-profit project (founded 2007) operating a large-scale web crawl whose data feeds many LLM training corpora (GPT-3 / GPT-4 / Llama / Claude training data has all included Common Crawl-derived datasets). Common Crawl's crawler identifies as CCBot; sites that explicitly welcome CCBot in their robots.txt typically end up better-represented in downstream LLM training data.

Full glossary index (70)

All terms in the Step Secrets Editorial Glossary. Each is a standalone reference page.

This is an adult site

This website contains age-restricted material. By entering you confirm that you are at least 18 years old (or the age of majority where you live) and consent to viewing sexually explicit content.

Leave

Parents: protect your children from adult content with these tools — RTA.