How to read from an excel file and compute the statistical median
Reading from an excel file provides a tool for data analysis and reporting like no other. This article attempts to demystify the same and provide a simple code abstract which achieves the same. It uses C# language constructs and simple arithmetic to achieve the goal.
I thought of an interesting problem that I had encountered on a problem solving site as the subject of this article.
The problem statement is as follows :
"Given a raw microsoft excel file (.xlsx - office 2007 and above) how do we read data from it and compute a statistical median from the same".
Firstly what are the "variables" here that determine our solution strategy.
a) We need to read the excel file
b) This is in an .xlsx format (typical to microsoft office)
c) We should understand what a median is and how to compute it given a set of data
Let us analyze the solution part by part -
For a) We need to use an appropriate OLEDB driver that can read the file.
For b) We choose the ACE driver. (The older Jet engine is not suitable for this purpose, as it does not support the xlsx format).
For c) By definition, the median is the numerical value separating the higher half of a data sample, a population, or a probability distribution, from
the lower half.
This brings about two natural scenarios.
- The number of data items is even.
- The number of data items is odd.
In the second case, the median is simply the mid-point of the data set.
In the first case it is the average of the two items that make up the mid-points of the data set.
In the illustrated code, the sample is made up of even number of data items.
Comments have been provided where required.
This article is suitable for beginners and is easy to understand.
ArrayList ar = new ArrayList(); //remember to provide a reference to System.Collections
//Use the ACE driver. This works with the xlsx file format and hence is suitable for the purpose
string con = @"Provider=Microsoft.ACE.OLEDB.12.0;;Data Source=D:\temp\median.xlsx;Extended Properties='Excel 12.0;HDR=Yes;'";
using (OleDbConnection connection = new OleDbConnection(con)) //define the scope
{
connection.Open(); //open a connection to the data source
OleDbCommand command = new OleDbCommand("select * from [Sheet1$]", connection);
using (OleDbDataReader dr = command.ExecuteReader()) //use a reader as it is fast
{
while (dr.Read()) //loop through the data
{
ar.Add(int.Parse(dr[0].ToString())); //add data items to the array list
}
}
}
ar.Sort(); //sort the arraylist
double ar_median = 0;
double ar_low = 0;
double ar_high = 0;
int lo = 0;
int hi = 0;
int arlo = 0;
int arhi = 0;
if ((ar.Count % 2) == 0) //number of data items are even
{
ar_low = Math.Floor((double)(ar.Count + 1) / 2);
ar_high = Math.Ceiling((double)(ar.Count + 1) / 2);
}
else //number of data items are odd
{
ar_low = Math.Floor((double)(ar.Count) / 2);
ar_high = Math.Ceiling((double)(ar.Count) / 2);
}
lo = int.Parse(ar_low.ToString())-1; //lo is an index in the arraylist which is zero based
hi = int.Parse(ar_high.ToString())-1; //hi is an index in the arraylist which is zero based
arlo = int.Parse(ar[lo].ToString());
arhi = int.Parse(ar[hi].ToString());
double arav = Convert.ToDouble(arlo + arhi);
ar_median = Math.Round(arav / 2); // even number of data items
MessageBox.Show(ar_median.ToString()); //result
That's all there is to it. The ramifications of this are mind-boggling. You can read data, parse data or even use it to generate graphs and charts
using third party controls in .net.