How to read from an excel file and compute the statistical median

Reading from an excel file provides a tool for data analysis and reporting like no other. This article attempts to demystify the same and provide a simple code abstract which achieves the same. It uses C# language constructs and simple arithmetic to achieve the goal.

I thought of an interesting problem that I had encountered on a problem solving site as the subject of this article.

The problem statement is as follows :

"Given a raw microsoft excel file (.xlsx - office 2007 and above) how do we read data from it and compute a statistical median from the same".

Firstly what are the "variables" here that determine our solution strategy.

a) We need to read the excel file

b) This is in an .xlsx format (typical to microsoft office)

c) We should understand what a median is and how to compute it given a set of data

Let us analyze the solution part by part -

For a) We need to use an appropriate OLEDB driver that can read the file.

For b) We choose the ACE driver. (The older Jet engine is not suitable for this purpose, as it does not support the xlsx format).

For c) By definition, the median is the numerical value separating the higher half of a data sample, a population, or a probability distribution, from

the lower half.

This brings about two natural scenarios.
- The number of data items is even.
- The number of data items is odd.

In the second case, the median is simply the mid-point of the data set.
In the first case it is the average of the two items that make up the mid-points of the data set.

In the illustrated code, the sample is made up of even number of data items.

Comments have been provided where required.

This article is suitable for beginners and is easy to understand.

ArrayList ar = new ArrayList(); //remember to provide a reference to System.Collections

//Use the ACE driver. This works with the xlsx file format and hence is suitable for the purpose
string con = @"Provider=Microsoft.ACE.OLEDB.12.0;;Data Source=D:\temp\median.xlsx;Extended Properties='Excel 12.0;HDR=Yes;'";
using (OleDbConnection connection = new OleDbConnection(con)) //define the scope
connection.Open(); //open a connection to the data source
OleDbCommand command = new OleDbCommand("select * from [Sheet1$]", connection);
using (OleDbDataReader dr = command.ExecuteReader()) //use a reader as it is fast
while (dr.Read()) //loop through the data
ar.Add(int.Parse(dr[0].ToString())); //add data items to the array list

ar.Sort(); //sort the arraylist

double ar_median = 0;
double ar_low = 0;
double ar_high = 0;

int lo = 0;
int hi = 0;

int arlo = 0;
int arhi = 0;

if ((ar.Count % 2) == 0) //number of data items are even
ar_low = Math.Floor((double)(ar.Count + 1) / 2);
ar_high = Math.Ceiling((double)(ar.Count + 1) / 2);
else //number of data items are odd
ar_low = Math.Floor((double)(ar.Count) / 2);
ar_high = Math.Ceiling((double)(ar.Count) / 2);

lo = int.Parse(ar_low.ToString())-1; //lo is an index in the arraylist which is zero based
hi = int.Parse(ar_high.ToString())-1; //hi is an index in the arraylist which is zero based

arlo = int.Parse(ar[lo].ToString());
arhi = int.Parse(ar[hi].ToString());

double arav = Convert.ToDouble(arlo + arhi);

ar_median = Math.Round(arav / 2); // even number of data items

MessageBox.Show(ar_median.ToString()); //result

That's all there is to it. The ramifications of this are mind-boggling. You can read data, parse data or even use it to generate graphs and charts

using third party controls in .net.


No responses found. Be the first to comment...

  • Do not include your name, "with regards" etc in the comment. Write detailed comment, relevant to the topic.
  • No HTML formatting and links to other web sites are allowed.
  • This is a strictly moderated site. Absolutely no spam allowed.
  • Name: