3. Data Prep
Tools for Preprocessing(Encoding/Scaling)
Click on Data Prep in the Machine Learning category.
Allocate to: Assign variable names for the model to perform the selected preprocessing tasks.
Code View: Preview the code that will be output.
Run: Execute the code.
Encoding
Sparse (OneHotEncoder): If true, returns the encoding result as a sparse matrix.
Handle unknown (OneHotEncoder, OrdinalEncoder): Used when encoding, if there is a category that exists in the training data but not in the test data. If ignore is selected, it will be set to 0, and if error is selected, a ValueError will be raised.
Unknown values (OrdinalEncoder): Fill with a specific value, not ignore or error.
Cols (TargetEncoder): Select the columns to encode.
Handle missing (TargetEncoder): Choose how to handle missing values.
Smoothing (TargetEncoder): When the number of data in a particular category is small, it adds the entered values and calculates the average of the categories to prevent overfitting.
Scaling
With mean (StandardScaler): Center the mean of the data to zero.
With std (StandardScaler): Scale the standard deviation of the data to 1.
With centering (RobustScaler): Performs centering by Q-subtracting the median from each attribute (column).
With scaling (RobustScaler): Scales each attribute by dividing it by its IQR.
Feature range (MinMaxScaler): Sets the minimum and maximum values for the scaled result.
Norm (Normalizer):
L1: The sum of the absolute values of each attribute will be 1.
L2: Scale the vectors so that their Euclidean distance is 1.
Max Norm: Ensures that the scaling result does not exceed an existing maximum value.
N bins (KBins Discretizer): Determines how many bins to divide the variable into.
Strategy (KBins Discretizer):
uniform: Divide the section by a uniform width.
QUANTILE: Divide so that each bin has an even number of data.
Encode (KBins Discretizer): Specify the encoding method.
ordinal: Encodes each interval as an integer.
onehot: Encodes each interval as a binary vector.
ETC(SimpleImputer / SMOTE / MakeColumnTransformer)
Missing values (SimpleImputer): Treats the entered values as missing.
Fill value (SimpleImputer): Replaces the missing value with the input value.
Copy (SimpleImputer): Returns the original data unchanged, as new data.
Add indicator (SimpleImputer): Adds a new column with 0s and 1s, with a 1 for rows with missing values and a 0 for rows without.
K neighbors (SMOTE): Specifies the number of neighbors to group together based on center point data.
Sampling strategy (SMOTE):
auto: Automatically adjusts the ratio of minority to majority class data to balance out class imbalances.
minority: Makes the size of the minority class dataset equal to the size of the majority class dataset.
float: You can specify the desired class ratio. For example, setting it to 0.5 makes the minority class dataset half the size of the majority class dataset.
Estimator (MakeColumnTransformer): You can specify different global models to apply to each column. The model selected here will be applied to the columns selected in Columns below.
Last updated