Datasets reference
The Dataset class is a core component of Giskard Open Source that represents a collection of test cases for evaluating LLM models.
- class giskard.Dataset(df: DataFrame, name: str | None = None, target: Hashable | None | NotGiven = NOT_GIVEN, cat_columns: List[str] | None = None, column_types: Dict[Hashable, str] | None = None, id: UUID | None = None, validation=True, original_id: UUID | None = None)[source]
To scan, test and debug your model, you need to provide a dataset that can be executed by your model. This dataset can be your training, testing, golden, or production dataset.
The
pandas.DataFrame
you provide should contain the raw data before pre-processing (categorical encoding, scaling, etc.). The prediction function that you wrap with the Giskard Model should be able to execute the pandas dataframe.- df
A pandas.DataFrame that contains the raw data (before all the pre-processing steps) and the actual ground truth variable (target). df can contain more columns than the features of the model, such as the sample_id, metadata, etc.
- Type:
pandas.DataFrame
- target
The column name in df corresponding to the actual target variable (ground truth).
- Type:
Optional[str]
- cat_columns
A list of strings representing the names of categorical columns (default None). If not provided, the categorical columns will be automatically inferred.
- Type:
Optional[List[str]]
- column_types
A dictionary of column names and their types (numeric, category or text) for all columns of df. If not provided, the categorical columns will be automatically inferred.
- Type:
Optional[Dict[str, str]]
- __init__(df: DataFrame, name: str | None = None, target: Hashable | None | NotGiven = NOT_GIVEN, cat_columns: List[str] | None = None, column_types: Dict[Hashable, str] | None = None, id: UUID | None = None, validation=True, original_id: UUID | None = None) None [source]
Initializes a Dataset object.
- Parameters:
df (pd.DataFrame) – The input dataset as a pandas DataFrame.
name (Optional[str]) – The name of the dataset.
target (Optional[str]) – The column name in df corresponding to the actual target variable (ground truth). The target needs to be explicitly set to None if the dataset doesn’t have any target variable.
cat_columns (Optional[List[str]]) – A list of column names that are categorical.
column_types (Optional[Dict[str, str]]) – A dictionary mapping column names to their types.
id (Optional[uuid.UUID]) – A UUID that uniquely identifies this dataset.
Notes
if neither of cat_columns or column_types are provided. We infer heuristically the types of the columns. See the _infer_column_types method.
- _infer_column_types(column_types: Dict[str, str] | None, cat_columns: List[str] | None, validation: bool = True)[source]
This function infers the column types of a given DataFrame based on the number of unique values and column data types. It takes into account the provided column types and categorical columns. The inferred types can be ‘text’, ‘numeric’, or ‘category’. The function also applies a logarithmic rule to determine the category threshold.
Here’s a summary of the function’s logic:
If no column types are provided, initialize an empty dictionary.
Determine the columns in the DataFrame, excluding the target column if it exists.
If categorical columns are specified, prioritize them over the provided column types and mark them as ‘category’.
Check for any unknown columns in the provided column types and remove them from the dictionary.
If there are no missing columns, remove the target column (if present) from the column types dictionary.
Calculate the number of unique values in each missing column.
For each missing column:
If the number of unique values is less than or equal to the category threshold, categorize it as ‘category’.
Otherwise, attempt to convert the column to numeric using pd.to_numeric and categorize it as ‘numeric’.
If the column does not have the expected numeric data type and validation is enabled, issue a warning message.
If conversion to numeric raises a ValueError, categorize the column as ‘text’.
Return the column types dictionary.
The logarithmic rule is used to calculate the category threshold. The formula is: category_threshold = round(np.log10(len(self.df))) if len(self.df) >= 100 else 2. This means that if the length of the DataFrame is greater than or equal to 100, the category threshold is set to the rounded value of the base-10 logarithm of the DataFrame length. Otherwise, the category threshold is set to 2. The logarithmic rule helps in dynamically adjusting the category threshold based on the size of the DataFrame.
- Returns:
A dictionary that maps column names to their inferred types, one of ‘text’, ‘numeric’, or ‘category’.
- Return type:
dict
- process()[source]
Process the dataset by applying all the transformation and slicing functions in the defined order.
- Returns:
The processed dataset after applying all the transformation and slicing functions.