Get_prompt_secret
Description
Attempts to extract secrets, passwords, or sensitive information embedded in an LLM's system prompt by using indirect references and circumventing model constraints.
Attack Types
Attack Types
Direct Solicitation — Simply asking for the secret directly
Encoded Solicitation — Requesting the secret in a transformed or encoded format
Character-by-Character Extraction — Getting individual characters or subsets of the secret
Letter/Character Position Techniques — Asking for specific letters at certain positions
Indirect Referencing — Referring to the secret without using restricted terms
Pattern Recognition Exploitation — Using examples to establish a pattern the model will follow
Repetition Attacks — Repeating similar prompts until the model complies
Language Misdirection — Using non-English prompts to bypass constraints
Alternate Representation Attacks — Requesting ASCII codes or other representations
Formatting Tricks — Using special formatting to bypass filters
Creative Context Setting — Creating scenarios where revealing the secret is contextually appropriate
Attack Examples
Asking the model to repeat back its instructions
Requesting a translation or transformation of the secret (ROT13, base64, etc.)
Asking for the first N letters/characters of the secret
Getting ASCII codes of characters instead of the characters themselves
Using creative delimiters to force interpolation of secrets
Providing examples of pattern extraction to prime the model
Using non-English language prompts to weaken instruction following
Requesting the secret as part of a creative work (poem, song, story)
Using code evaluation context to extract information
Creating fictional scenarios where revealing the secret is "safe"
Using restricted character sets (e.g., only words starting with 'S')
Example Prompts
Last updated