githubEdit

Data Manipulation for Machine Learning

In attack perspective for machine learning, we manipulate dataset values to unexpected ones. This may destroy the performance of ML models by inserting inappropriate (or nonsense) values. However, to achieve this, we need permission to access the training dataset.

Prepare Dataset

Before manipulation, load dataset as DataFrame as Pandas.

import pandas as pd

df = pd.read_csv('example.csv', index_col=0)

Data Analysis

Before attacking, need to investigate the dataset and find the points where we can manipulate and fool models and people.

# Information
df.info()

# Print descriptive statistics
df.describe()

# Dimensionality
df.shape

# Data types
df.dtypes

# Correlation of Columns
df.corr

# Histgram
df.hist()

Access Values

Attacks

After analyzing data, we're ready to attack this.

Value Overriding

Override the values to abnormal or unexpected values.

Filling Missing (NaN) Values with Inappropriate Methods

Typically, NaN values are filled with the mean of the values. However in attack perspective, other methods can be used e.g. max() or min().

Another Dataset Integration

Integrating another dataset values, it may fool ML models with fake values. For example, the following fake_scores.csv contains fake scores for each person. This changes all original scores to fake scores by creating a new DataFrame which is integrated this fake dataset.

Required Columns Removing

Remove columns which are required to train model. This is blatant and may be not useful, but write it down just in case.

Last updated