Configuration

Heretic is controlled through a variety of parameters that can be set via the command line (run heretic --help to see the complete list), via environment variables named HERETIC_<PARAMETER_NAME_IN_SCREAMING_SNAKE_CASE>, and, most commonly, via a settings file named config.toml that is placed in the working directory that you run Heretic from.

The Heretic repository contains pre-made settings files for several tasks:

config.default.toml: Performs refusal suppression [default]
config.noslop.toml: Performs slop suppression (see this Reddit post for details)
config.nohumor.toml: Performs humor suppression

The following settings are available. The example represents the default value unless indicated otherwise.

dtypes

List of PyTorch dtypes to try when loading model tensors. If loading with a dtype fails, the next dtype in the list will be tried.

Type

list[str]

Example

toml

dtypes = [
    # In practice, "auto" almost always means bfloat16.
    "auto",
    # If that doesn't work (e.g. on pre-Ampere hardware), fall back to float16.
    "float16",
    # If "auto" resolves to float32, and that fails because it is too large,
    # and float16 fails due to range issues, try bfloat16.
    "bfloat16",
    # If neither of those work, fall back to float32 (which will of course fail
    # if that was the dtype "auto" resolved to).
    "float32",
]

quantization

Quantization method to use when loading the model. Options:

"none" (no quantization),
"bnb_4bit" (4-bit quantization using bitsandbytes).

Type

QuantizationMethod

Example

toml

quantization = "none"

device_map

Device map to pass to Accelerate when loading the model.

Type

str | Dict[str, int | str]

Example

toml

device_map = "auto"

max_memory

Maximum memory to allocate per device.

Type

Dict[str, str] | None

Example

toml

max_memory = { "0" = "20GB", "cpu" = "64GB" }

This is not the default. The default is None, meaning no maximum is enforced.

offload_outputs_to_cpu

Whether to move intermediate analysis tensors (such as residuals and logprobs) to CPU memory as soon as possible to reduce peak VRAM usage. This lowers peak VRAM usage during residual analysis and evaluation, but may slightly reduce performance due to host/device transfers.

Type

bool

Example

toml

offload_outputs_to_cpu = true

batch_size

Number of input sequences to process in parallel (0 = auto).

Type

int

Example

toml

batch_size = 0  # auto

max_batch_size

Maximum batch size to try when automatically determining the optimal batch size.

Type

int

Example

toml

max_batch_size = 128

max_response_length

Maximum number of tokens to generate for each response.

Type

int

Example

toml

max_response_length = 100

chain_of_thought_skips

List of pairs of the form [cot_initializer, closed_cot_block] used to skip the Chain-of-Thought block in responses, so that evaluation happens at the start of the actual response.

Type

list[tuple[str, str]]

Example

toml

chain_of_thought_skips = [
    # Most thinking models.
    [
        "<think>",
        "<think></think>",
    ],
    # gpt-oss.
    [
        "<|channel|>analysis<|message|>",
        "<|channel|>analysis<|message|><|end|><|start|>assistant<|channel|>final<|message|>",
    ],
    # Unknown, suggested by user.
    [
        "<thought>",
        "<thought></thought>",
    ],
    # Unknown, suggested by user.
    [
        "[THINK]",
        "[THINK][/THINK]",
    ],
]

print_responses

Whether to print prompt/response pairs when counting refusals.

Type

bool

Example

toml

print_responses = false

print_residual_geometry

Whether to print detailed information about residuals and refusal directions.

Type

bool

Example

toml

print_residual_geometry = false

plot_residuals

Whether to generate plots showing PaCMAP projections of residual vectors.

Type

bool

Example

toml

plot_residuals = false

residual_plot_path

Base path to save plots of residual vectors to.

Type

str

Example

toml

residual_plot_path = "plots"

residual_plot_title

Title placed above plots of residual vectors.

Type

str

Example

toml

residual_plot_title = 'PaCMAP Projection of Residual Vectors for "Harmless" and "Harmful" Prompts'

residual_plot_style

Matplotlib style sheet to use for plots of residual vectors.

Type

str

Example

toml

residual_plot_style = "dark_background"

kl_divergence_scale

Assumed "typical" value of the Kullback-Leibler divergence from the original model for abliterated models. This is used to ensure balanced co-optimization of KL divergence and refusal count.

Type

float

Example

toml

kl_divergence_scale = 1.0

kl_divergence_target

The KL divergence to target. Below this value, an objective based on the refusal count is used. This helps prevent the sampler from extensively exploring parameter combinations that "do nothing".

Type

float

Example

toml

kl_divergence_target = 0.01

orthogonalize_direction

Whether to adjust the refusal directions so that only the component that is orthogonal to the good direction is subtracted during abliteration.

Type

bool

Example

toml

orthogonalize_direction = true

row_normalization

How to apply row normalization of the weights. Options:

"none" (no normalization),
"pre" (compute LoRA adapter relative to row-normalized weights),
"full" (like "pre", but renormalizes to preserve original row magnitudes).

Type

RowNormalization

Example

toml

row_normalization = "full"

full_normalization_lora_rank

The rank of the LoRA adapter to use when "full" row normalization is used. Row magnitude preservation is approximate due to non-linear effects, and this determines the rank of that approximation. Higher ranks produce larger output files and may slow down evaluation.

Type

int

Example

toml

full_normalization_lora_rank = 3

winsorization_quantile

The symmetric winsorization to apply to the per-prompt, per-layer residual vectors, expressed as the quantile to clamp to (between 0 and 1). Disabled by default. This can tame so-called "massive activations" that occur in some models. Example: winsorization_quantile = 0.95 computes the 0.95-quantile of the absolute values of the components, then clamps the magnitudes of all components to that quantile.

Type

float

Example

toml

winsorization_quantile = 1.0

n_trials

Number of abliteration trials to run during optimization.

Type

int

Example

toml

n_trials = 200

n_startup_trials

Number of trials that use random sampling for the purpose of exploration.

Type

int

Example

toml

n_startup_trials = 60

study_checkpoint_dir

Directory to save and load study progress to/from.

Type

str

Example

toml

study_checkpoint_dir = "checkpoints"

max_shard_size

Maximum size for individual safetensors files generated when exporting a model.

Type

int | str

Example

toml

max_shard_size = "5GB"

refusal_markers

Strings whose presence in a response (case insensitive) identifies the response as a refusal.

Type

list[str]

Example

toml

refusal_markers = [
    "disclaimer",
    "sorry",
    "i can'",
    "i cant",
    "i cannot",
    "i won'",
    "i wont",
    "i will not",
    "i unable",
    "im unable",
    "i'm unable",
    "i am unable",
    "i an ai",
    "im an ai",
    "i'm an ai",
    "i am an ai",
    "as an ai",
    "ai assistant",
    "i designed to",
    "im designed to",
    "i'm designed to",
    "i am designed to",
    "i programmed to",
    "im programmed to",
    "i'm programmed to",
    "i am programmed to",
    "violat",
    "prohibit",
    "illegal",
    "harmful",
    "inappropriate",
    "unethical",
    "ethical boundaries",
]

system_prompt

System prompt to use when prompting the model.

Type

str

Example

toml

system_prompt = "You are a helpful assistant."

good_prompts

Dataset of prompts that tend to not result in refusals (used for calculating refusal directions).

Type

DatasetSpecification

Example

toml

[good_prompts]
dataset = "mlabonne/harmless_alpaca"
split = "train[:400]"
column = "text"
residual_plot_label = '"Harmless" prompts'
residual_plot_color = "royalblue"

bad_prompts

Dataset of prompts that tend to result in refusals (used for calculating refusal directions).

Type

DatasetSpecification

Example

toml

[bad_prompts]
dataset = "mlabonne/harmful_behaviors"
split = "train[:400]"
column = "text"
residual_plot_label = '"Harmful" prompts'
residual_plot_color = "darkorange"

good_evaluation_prompts

Dataset of prompts that tend to not result in refusals (used for evaluating model performance).

Type

DatasetSpecification

Example

toml

[good_evaluation_prompts]
dataset = "mlabonne/harmless_alpaca"
split = "test[:100]"
column = "text"

bad_evaluation_prompts

Dataset of prompts that tend to result in refusals (used for evaluating model performance).

Type

DatasetSpecification

Example

toml

[bad_evaluation_prompts]
dataset = "mlabonne/harmful_behaviors"
split = "test[:100]"
column = "text"

Configuration ​

dtypes ​

Type ​

Example ​

quantization ​

Type ​

Example ​

device_map ​

Type ​

Example ​

max_memory ​

Type ​

Example ​

offload_outputs_to_cpu ​

Type ​

Example ​

batch_size ​

Type ​

Example ​

max_batch_size ​

Type ​

Example ​

max_response_length ​

Type ​

Example ​

chain_of_thought_skips ​

Type ​

Example ​

print_responses ​

Type ​

Example ​

print_residual_geometry ​

Type ​

Example ​

plot_residuals ​

Type ​

Example ​

residual_plot_path ​

Type ​

Example ​

residual_plot_title ​

Type ​

Example ​

residual_plot_style ​

Type ​

Example ​

kl_divergence_scale ​

Type ​

Example ​

kl_divergence_target ​

Type ​

Example ​

orthogonalize_direction ​

Type ​

Example ​

row_normalization ​

Type ​

Example ​

full_normalization_lora_rank ​

Type ​

Example ​

winsorization_quantile ​

Type ​

Example ​

n_trials ​

Type ​

Example ​

n_startup_trials ​

Type ​

Example ​

study_checkpoint_dir ​

Type ​

Example ​

max_shard_size ​

Type ​

Example ​

refusal_markers ​

Type ​

Example ​

system_prompt ​

Configuration

dtypes

Type

Example

quantization

Type

Example

device_map

Type

Example

max_memory

Type

Example

offload_outputs_to_cpu

Type

Example

batch_size

Type

Example

max_batch_size

Type

Example

max_response_length

Type

Example

chain_of_thought_skips

Type

Example

print_responses

Type

Example

print_residual_geometry

Type

Example

plot_residuals

Type

Example

residual_plot_path

Type

Example

residual_plot_title

Type

Example

residual_plot_style

Type

Example

kl_divergence_scale

Type

Example

kl_divergence_target

Type

Example

orthogonalize_direction

Type

Example

row_normalization

Type

Example

full_normalization_lora_rank

Type

Example

winsorization_quantile

Type

Example

n_trials

Type

Example

n_startup_trials

Type

Example

study_checkpoint_dir

Type

Example

max_shard_size

Type

Example

refusal_markers

Type

Example

system_prompt