Where we're going with all of this
The goal of this course is to get you from zero to done with Excel (or Excel for Mac, or Google Sheets).
​
Excel accepts human data - messy, chaotic, repetitive, full of random spaces, contradictory, misspelled; it captures the data as it comes. It won't demand any pre cleaning or organisation; it provides the space to figure that all out. Excel lets you put notes, musings or song lyrics wherever you want without breaking.
​
Try doing that with Access, SQL or Tableau and see how far you get...
​
Given how flexible Excel is, working with it involves more than just 'copy this paste it here'. Anyone can do that - hell even AI sometimes manages to do it (results vary). The more important part is being able to think about what the data is telling you and what it may tell you in the future.
​
Each module will have a quantitative part to it which shows how to resolve the qualitative problem posed.
The rest of this module assumes you know what question you are trying to answer with your data, so if the answer is 42 that will mean something.
​​
Questions can be anything from straightforward factual to half fact, half opinion:
-
What is the average number of pastries we sell per month?
-
Should we close one of our 4 locations?
-
What company should I invest in?
-
What is the average grade our students achieved on this particular test and what, if anything, can we learn from analysing that further?​
The module also assumes you know the obvious data you need, and where to get them. For example if you are building a cashflow, I assume you know you'll need operating cash flow, depreciation and amortisation etc. If you don't, Google!
​
Now onto the focus of this module: material missing data.
Qualitative: What material data is missing?
If this was an easy answer, the job advert wouldn't have asked for 'advanced Excel skills'. Even worse, the 'answer' is it depends. You'll need to apply your nouse to your own situation but here are some surprises that I've found that help generate questions specific to your situation:
​
Theory: Google Analytics captures all of your web traffic.
Nope - things like ad blockers, privacy settings, certain browsers, technical issues all get in the way of perfect information.
Theory: Feedback ratings match customer experiences.
Nope - When is the last time you wrote feedback for something that was 'fine' like drain cleaner or the 6th reorder of tissues?
​
And finally - perhaps the best example - is Abraham Wald's contribution to the Allies in WWII. The prevailing opinion was that the parts of the airplane that were shot the most should be reinforced. He pointed out that only the planes that made it back were analysed, which meant that the shots in those planes didn't down the aircraft. It stood to reason then that the planes that didn't return were shot in other areas, so those other areas should be reinforced. It's worth a Google for the full story!
​​
​
Okay I get the theory, so what do I do about the problem?
​​
-
Make a list of known unknowns - the data you know you won't be able to get, and why not.
-
Determine whether you'll be able to put in any pro forma data to capture what is missing (can you infer from other data what is missing?)
-
Decide where you'll disclose what you know isn't there, in an obvious and visible way. No hiding it in footnote 27.
-
*
​
Sometimes, you won't notice data missing until you start working with it - and worse, you may not notice right away and build something which has to be redone. Avoiding that situation is what the rest of this module covers.
​
We won’t cover these in detail but if you’re looking at building something more involved or longer term, consider:
-
Repeatability - can you acquire this data going forward or have you promised your firstborn child to a colleague for a one off run?
-
Regulations - anything specific to your industry or jurisdiction that could either help you source data (hi MIFID!) or require you to process it in a different way?
Quantitative: Trust no [data]
​​Finally you received your data from Jane in Accounts and you're ready to build. ​This is where the best laid plans make contact with reality... data is wrong and incomplete ALL OF THE TIME.
​
Let me say that again for folks in the back - data is wrong and/or incomplete ALL OF THE TIME.
​​
There is usually nuance to this - 'wrong' may mean it's not what you expected; for example a spreadsheet could show a customer purchase as a negative (because it's reflecting the decrease in available inventory) but you are looking at it from a revenue perspective so expect it to be positive.
​
Unfortunately wrong can also mean straight up inaccurate - either because of data entry mistakes, data maintenance mistakes, some sort of corruption, or gremlins (technical term meaning no one knows how the f*@!& this happened).
​
These issues are sometimes easy to spot if the differences are large enough to be obvious, but can creep through your analysis if you haven't checked.
​
Sometimes you just have to laugh... because screaming F*&@ in the office is frowned upon
Here are a few real life examples of what I've been sent in a professional context, billed as an appropriate response (identifying details removed to protect the guilty).
I find it least disappointing to assume the source of the data has not done any checks to make sure the data is what you need, is usable/or at least predictably unusable. I also assume they never clean up any of their data or check that an overenthusiastic intern hasn't accidentally overridden years of data in the early 2000s.
​
The most common, and time wasting, issues I see on a regular basis are:
​
Unit & Currency
​
The point here is consistency. Regardless of what your report shows you, either in number of units sold (vs items vs boxes) or in currency, there must be a clear label of what exactly it is. If not, it's easy to guess wrong - and then have to redo it all.
Click here to download the workbook from the video. Yellow cells have formulas, green cells are where users would normally input data.
Does Excel know they are dates and numbers?
​
As covered in the Primer, Excel understands how to operate with dates and numbers. That is, if it knows what you gave it is a number or a date.
​
A few things to check:
​
-
Are your dates American style or European style? If you do not have any dates from later in the month, ASK.
-
Can you reformat them to, say, 3 Mar 2024? ​​
​
If not, you can use this formula or a simple variant of it to convert them - which as a bonus means your PivotTables can group by month, quarter and year (if you have that data) and your chart axes labels can be labeled in a human friendly way. A1 is where your faux date is, and it looks like this 16/05/2024.
​
=date(right(A1,4),mid(A1,4,2),left(A1,2))
​
With number entries, check whether when you highlight the range, you get a total in the lower right hand corner of your screen. If not, double click into a cell to see what is going on. Sometimes it's a leading or trailing space; use the TRIM() function to sort it. If it's a comma, use the find and replace function to replace every comma with nothing (aka removing all commas).
Negatives, blanks, leading zeros
​
Do you have negative data where you expect only positive values? Investigate.
What does a blank mean in your data? Does it mean zero sales? Does it mean you are missing data (a team member hasn't filled in their report)? Is this cell filled with spaces so you can't sum the column? (yes, that happens....)
​
Leading zeros is a funny one that usually happens with client identifiers whereby an accounting system assigns an account with the number 00023. Possibly they're the 23rd unique client that was added to the database, who knows what they were doing 25 years ago when they put the system in.
​
Problem is when you bring that into Excel, it will assume you mean 23 and chop off all of the zeros. Sometimes that is useful but it becomes a problem when you need those zeros to differentiate it from 023.
Quantity has a quality of its own
Or rather, certain parts of Excel have a functional limit, like filters, sorting, copy/paste, delete rows...
​
If you have a few hundred lines you're likely fine.
​
10s of thousands? One of two things can happen.
Either:
-
For filters, the stupid sign pops up in your dropdown saying 'not all results are displayed' [due to too many results]. At that point I don't use the filters anymore anyway as I've found the results are inconsistent.
-
The function (filter, sort, copy/paste, delete rows) is overwhelmed, can't load, or takes *@£&$@£ ages to load and crashes anyway.
​
This then becomes another good example of achieving the same result in different ways in Excel. Instead of filtering, I'll use formulas to determine which rows I don't want.
​
In this scenario, I'm evaluating the best location for an electronic billboard campaign for domestic electric charge points in the US state of Washington. I want to know where most people who own an electric vehicle live, which I can get from their vehicle registration.
​
In this data set of Washington State vehicle registrations, there are entries that could skew my results away from what I want so I need to get rid of them.
​
-
For example I want to disregard any rental fleet vehicles (Hertz, Enterprise Rent A Car and the like), or company fleet vehicles (those that are returned to a company garage at the end of the work day). For our purposes, this will be any vehicle with a vehicle ID between 100000000 and 200000000.​​
This file is particularly large so it may take some time to download.
I also want to disregard any vehicle that can drive less than 30 miles on the electric battery. When I go to do some checks on my data to make sure it is what I expect, I found a problem.
(Sidenote - this data came from data.gov - and while we can all snark about the government's inability to function, there is some expectation of the data being relatively clean and usable. This falls short.)
​
Washington State's 'Electric Range' is crap; for example the Tesla Model 3 is listed at 0 miles. ​What can we do about this? It depends; to know how to proceed we need more information.
​
Enter, the PivotTable.
​​
Pivot Tables, like VLOOKUPs have an undeserved reputation as monsters. All they are, are the drag and drop version of SUMIF and SUMIFs that we covered in the Primer.
​
Instead of writing formulas you can just drag and drop columns? Sounds faster and less faffy!
​
And it is, so long as you work within the structure that the PivotTable provides.
(same download workbook as above)
​
​If you wanted to do this with a set of formulas, you'd first need a unique list of cities then depending on what other criteria you were looking at, you'd need a unique list of them, then you'd create your formula and drag it down and across. You would be best off saving as values to sort, which defeats the flexibility provided by using formulas in this case.
​
With practice you'll have a good idea of what to use for each data task you come across; and even if something goes wrong it is easy to pivot (pun somewhat intended) to something else!
Learned something? Leave a tip!
Help keep this course free by leaving a tip if you found the content on this page helpful, timesaving, nice to know, funny, or otherwise feel it is worth keeping online. The average contributor pitches in £5 - £10 per module, and if you need an invoice or receipt to submit to your company for continuing professional development drop me a line and I'll sort it out.