N/A

Background

Throughout the first year in University, I took Object Oriented Programming class. While being a challenging class, it was also rewarding. It challenged me in various ways and allowed to develop a better understanding as well as higher competence in both Python programming and thinking.

For the final project in this class, we were tasked to create an Imputer, the task would explain an imputer as follows: “Most data science projects start by pre-processing a dataset to ensure the data is ready to use for its intended purpose. One of the tasks that a data scientist would typically complete during such a pre-processing phase is to replace missing data values in the dataset using a process known as imputation. “ In other words, we had to create a tool that would fill in the missing values with either a mean/mode/median.

Methodology

To achieve this, we were tasked to utilize Strategy Pattern. A strategy pattern involves separating classes in order to make them interchangeable, in this situation, the behaviors we wanted to change are between methods of imputation (mean/mode/median). Therefore I produced the code such as:

from abc import ABC, abstractmethod

class ImputerStrategy(ABC):  # Interface for the Imputer class
    @abstractmethod
    def fit(self):
        pass

    @abstractmethod
    def transform(self):
        pass

class CalculateStrategy(ABC):  # Interface for the Mean/Mode/Median class
    @abstractmethod
    def calculate(self):
        pass

class AxisStrategy(ABC):  # Interface for axis-specific strategies
    @abstractmethod
    def select(self, data):
        pass

Brief explanation of the code:

ImputerStartegy - Is acting as an interface for the imputer class. This is where the we fit and transform the imputer itself
CalculateStrategy - Is the actual calculation startegy, at the moment it is the interface of the entire platform. This will make sure that we use the same methodology to calculate the mean/mode/median
AxisStrategy - This method was created to automatically sort by column and not row.

After establishing all the Interface classes, we can proceed to create the logic in each of the interfaces.

To make things easier for myself, i created first the logic for “Axis”, or in this case, so the code would only look through the columns instead of Rows.

Side note: For future self (if i find the need to use this program), i created a strategy as well as for rows.

class Axis0(AxisStrategy):              #Axis0: Only corresponds to implementatiosn for Columns. NOT ROWS!!!!
    def __init__(self,data):
        self._data = data
    
    def select(self, column_index):
        return [row[column_index] for row in self._data if row[column_index] != "nan"]   #Only extracts the columns, if the value is not nan.

class Axis1(AxisStrategy):              #Axis1: Only corresponds to implentations for Rows. NOT NEEEDED !
    
    def select(self, data):
        pass

In this code, the Axis0 will basically loop throughout the columns and Axis1 will go through the rows.

Furthermore, the logic for the calculation. As previously established interface, there will be 3 difference classes that inherit the logic from “CalculationStrategy” class. The code is as follows:

class Mean(CalculateStrategy):

    def calculate(self,data):
        return s.mean(data)        #Calculates the mean and returns the data

class Mode(CalculateStrategy):      

    def calculate(self,data):
        return s.mode(data)        #Calulcates the mode and returns the data

class Median(CalculateStrategy):

    def calculate(self,data):
        return s.median(data)

All the methods used were imported via “Statistics” library, which came in very convenient to create the necessary imputations.

And finally, to round it all together a seperate Imputer class was created. This class was responsible for storing all the previously established logic and produce an end result.

class Imputer:
    def __init__(self,strategy:str="mean",axis:int=0):
        self._strategy = strategy
        self._axis = axis
        self._ImputeValues = None   #For storing the extracted data + applying one of the mean/mode/median calculations
        self._axis_strategy = None
    
    def fit(self, x, data):
        self._axis_strategy = Axis0(data) if self._axis == 0 else Axis1() #This will determine weather it's for column or rows Axis 0 = Column/Axis1 = Row
        column_data = self._axis_strategy.select(x)     

        #Strategy selection
        if self._strategy == "mean":
            calculator = Mean()
        elif self._strategy == "mode":
            calculator = Mode()
        elif self._strategy == "median":
            calculator = Median()
        else:
            raise ValueError ("Only accepts mean/mode/median")
        
        self._ImputeValues = calculator.calculate(column_data)

    
    def transform(self, column_index, data):
        for row in data:
            if row[column_index] == "nan":
                row[column_index] = self._ImputeValues

        return data

Result

When examaning a dataset as follows:

data = [
    ['France', 44.0, 72000.0],
    ['Spain', 27.0, 48000.0],
    ['Germany', 30.0, 54000.0],
    ['Spain', 38.0, 61000.0],
    ['Germany', 40.0, 'nan'],
    ['France', 35.0, 58000.0],
    ['Spain', 'nan', 52000.0],
    ['France', 48.0, 79000.0],
    ['Germany', 50.0, 83000.0],
    ['France', 37.0, 67000.0]
]

With the presented dataset, there is a simulated "none" value. With utilizing the code we can see the functionality. The code works as follows.

Initialize the Imputer.

imputer = Imputer("mean", 0)

Establish where the Imputer should function

imputer.fit(2,data)

Performing the actual transformation

transformed_data1 = imputer.transform(2,data)

the final output is as demosntrated: Dataset Screenshot

The full project can be seen on my Github

Python Imputer

Background

Methodology

Result