Excel Spreadsheet Data Entry Tips
Prepared by the Center for Biostatistics, Department of Biomedical Informatics
The Ohio State University
Revised 04/15/2015
Note:
The following document presents some general tips and guidelines for research data entered
into Excel. These tips are designed to avoid common problems and minimize the time spent
‘cleaning’ the data before actual analysis.
While these rules cover some basic aspects of data entry, we strongly recommend consulting
with a statistician before starting to collect your data.
Excel is currently a popular choice for data entry/management for small research projects.
However, Excel may not necessarily be the best/most efficient way of recording your data; a
statistician can recommend other choices which may be better suited to your particular needs.
This document was prepared by the Center for Biostatistics. Please contact us if you have
questions/comments about this guide or would like to speak with a statistician regarding your
particular research project.
Center for Biostatistics:
Website: http://www.biostatistics.osu.edu
Email: biostatistics@osumc.edu
Phone: (614) 293-6899
Page 1
Excel Spreadsheet Data Entry Tips
Prepared by the Center for Biostatistics, Department of Biomedical Informatics
The Ohio State University
Topic
Page(s)
1. Variable Names
2
2. ID Variable
3
3. General Data Entry Rules
3-4
4. Missing Data
5
5. Multiple Responses
5
6. Dates
5
7. Calculated Data
6
8. Multiple Measures Per Subject
6
9. Survival Data
7
10. Multiple Datasets
8
11. Avoiding Common Sheet ‘Pitfalls’
9-10
12. Example: Spreadsheet with Some Common Data Issues
11-12
Revised: 04/15/2015
Excel Spreadsheet Data Entry Tips Prepared by the Center for Biostatistics
Revised: 04/15/2015 http://www.biostatistics.osu.edu
Page 2
Excel Spreadsheet Data Entry Tips:
1. Variable Names:
Enter variable names in the first row of the spreadsheet.
Do not put spaces in the name. Use the underscore “_” character instead (e.g.,
body_weight” instead of “body weight).
Keep variable names simple and short (e.g., ‘pt_weightor ‘bodywtinstead of
patient_body_weight_in_kg_measured_at_baseline’)
You can create a key to put more information about each variable on a separate
sheet if desired:
The first character of the variable name should be a letter, not a number (e.g.,
week1’ instead of ‘1st_week).
No special characters (e.g., !, @, *) in the name.
Make sure each variable name is unique. Do not use merged cells to differentiate
variables.
So instead of
this…
Arrange your
data like this…
The variable names
dx_date’ and
Medications’ are
repeated in the 2
nd
row.
units of
measurement (e.g.,
kg, years) in the
description where
applicable!
Excel Spreadsheet Data Entry Tips Prepared by the Center for Biostatistics
Revised: 04/15/2015 http://www.biostatistics.osu.edu
Page 3
2. ID Variable:
Always include an ID variable on each sheet of your workbook so that variables are
properly associated with each subject.
o Note that the ID needs to be unique for each subject in your study!
Do not use MRN, patient names, social security numbers, or any other identifying information
that would violate HIPAA rules as ID variablescreate your own study ID instead and be sure to
keep a key so that you can match the study ID with the original identifying information.
3. General Data Entry Rules:
One row per subject (please see ‘8. Multiple Observations Per Subject’ for
exceptions).
One column per variable.
One value per cell (please see ‘5. Multiple Responses’ for more information).
o Special case: values composed of multiple components, such as blood
pressure
Avoid text for values if possible use numbers instead (e.g., 0 for Male and 1 for
Female).
If you use text values, be careful about spelling/capitalization!!
o In our statistical programs, ‘Male’ is not the same as ‘male’ or ‘Muse only
one form in your data entry and be consistent!
Instead of putting both systolic and diastolic pressures in one
column, create separate columns for each component.
NO
YES
Excel Spreadsheet Data Entry Tips Prepared by the Center for Biostatistics
Revised: 04/15/2015 http://www.biostatistics.osu.edu
Page 4
If you do use numbers to represent text, we recommend creating a key on a
separate sheet:
Any extra text or notes should go in a separate column, not within the variables
themselves.
For numeric variables, please be consistent with the units:
In this example, the unit of measurement is ‘months’. Therefore, we want to make
sure that all entries are measured in months (not weeks, days, or any other units of
time).
Do not use the following to organize your data:
o Color coding
o Merged cells
o Blank rows/columns
NO
YES
NO
YES
Be sure that the
variable names in
your key match
the names in the
spreadsheet!
Excel Spreadsheet Data Entry Tips Prepared by the Center for Biostatistics
Revised: 04/15/2015 http://www.biostatistics.osu.edu
Page 5
4. Missing Data
Leave blank or code with an identifier that does not match any other numerical
value entered (e.g., -9999).
Do not use text to represent missing data, especially if your variable is numeric:
5. Multiple Responses
When a question/variable has multiple responses that are not mutually exclusive,
we recommend you create separate variables for each response.
For example, suppose we have a variable Medslisting all of the medications a
patient was taking:
6. Dates
Please use MM/DD/YYYY format (e.g., 12/12/2014).
Be consistent when entering dates, particularly with the 4 digit year! (e.g., do not put both
12/12/2014 and 12/12/14)
NO
NO
YES
Instead of having multiple responses in theMeds” variable separated by
commas, create separate variables for each possible response. Code each
variable as 1 = yes or 0 = no.
Excel Spreadsheet Data Entry Tips Prepared by the Center for Biostatistics
Revised: 04/15/2015 http://www.biostatistics.osu.edu
Page 6
7. Calculated Data
We always prefer the original (‘raw’) data over calculated totals, formulas,
‘normalized’ data, etc.
o For example, please provide actual dates instead of calculated days between
measurements, raw CT values rather than fold changes.
We can calculate these quantities easily in our statistical programs.
8. Multiple Observations Per Subject
When subjects have multiple observations (e.g., time points, replicates, etc.), we
generally prefer that the data are arranged so that there are multiple rows per
subject, one row for each observation.
Be sure to repeat the ID value as well as any other variables that are associated with
the subject that remain constant (e.g., race, gender).
So instead of this…
Arrange your data
like this…
Please talk with your statistician about your particular situation before entering data!
One row per patient,
each measurement in
a separate column
(‘
Week1
– ‘
Week3
’)
One row for each
observation. All
measurements are in one
column (Measurement),
with another column
(‘Week’) to identify the
week. Note that the ID,
age, and gender variables
need to be filled in for each
observation.
Excel Spreadsheet Data Entry Tips Prepared by the Center for Biostatistics
Revised: 04/15/2015 http://www.biostatistics.osu.edu
Page 7
9. Survival Data
Do not provide summary data by time point!
For each subject, provide the following information:
o Start date/time
o End date/time this corresponds to either the date/time the subject had the
event of interest, or the last date/time of the study.
o Status at end date/time:
1 = died/had the event of interest
0 = censored/did not have the event of interest
If your survival data are more complicated (e.g., you want to look at overall survival
as well progression-free or disease-free survival, or want to consider competing
risks), be sure to talk with your statistician about the best way to record the
information:
o Dates are always preferred over calculated times.
o When possible, each event/time should be put in separate columns.
In this example, both overall survival (OS) and progression free survival (PFS) are of
interest. Note that we have separate columns time and an indicator for each event.
Excel Spreadsheet Data Entry Tips Prepared by the Center for Biostatistics
Revised: 04/15/2015 http://www.biostatistics.osu.edu
Page 8
10. Multiple Datasets
Only include one dataset per sheet. Do not put unrelated sets of data on the same
page.
Exception: If you are collecting the same information in various datasets (e.g.,
running the same experiment over different time points/batches, collecting the
same information in different treatment groups), you can arrange the data on one
sheet. In this case, please do not put the data in ‘blocks’:
So instead of
this…
Arrange your
data like
this…
In this example, the same
experiment was run in
batches on three
different dates. Note
how the data are
grouped in ‘blocks’
according to the date of
the experiment.
Therefore, values for
Mouse_ID’,
Treatment’, and
Measurement’ are
contained in more than
one column.
All three batches are now
combined, with one column
for each variable. Each
mouse has an ‘Exp_date
value to identify group
membership.
Excel Spreadsheet Data Entry Tips Prepared by the Center for Biostatistics
Revised: 04/15/2015 http://www.biostatistics.osu.edu
Page 9
11. Avoiding Common Sheet Pitfalls
Do not include any plots/figures on your data sheet put them on a separate sheet.
Do not include notes or summary statistics (e.g., means, standard deviations) next to
or below your data on the same sheet!
Do not put variable descriptions or information about the values of a variable in the
same cell as the variable name/header.
Do not repeat headers throughout the worksheet.
NO
NO.
Please see ‘1. Variable Names’, page 2 and
‘3. General Data Entry Rules’, page 4 for
examples of how to include extra
information about a variable.
NO
Excel Spreadsheet Data Entry Tips Prepared by the Center for Biostatistics
Revised: 04/15/2015 http://www.biostatistics.osu.edu
Page 10
If the variable is numeric, do not use ‘<’ or ‘>.
NO
YES
Instead of ‘<100’, replace with
the lower bound of 100 in this
example. Talk with your
statistician about what makes
clinical sense for your data!
Excel Spreadsheet Data Entry Tips Page 11
Prepared by the Center for Biostatistics (http://www.biostatistics.osu.edu) Revised: 04/15/2015
Example: Spreadsheet with Some Common Data Issues
The following hypothetical spreadsheet would require extensive data management before analysis.
Location/Site Patient Subject Number
Race Gender Comorbidities White Blood Cell Count Length of Stay date
Complications
1 C
M
2 9.6
3
3/26/14
X
2 C M 3 6.7 6 11/24/2013
X
3 AA male 1
12.2
13
2/8/2014
4 A A female 1, 2 7.8 8 3/19/2014 x
5 O
m 3 8.3 3 wks UNKNOWN x
6 AA F 2 6.4 13 2/1/2014
7 AA
F
1,3 4.9
9
4/13/2014
8 C M 2 <5 2 9/14/2013
9 O - 2
10.4
6
12/5/2013
10 C M 1 8.3 8 3/2/2014
11 C
M 2 <5 5 wks 4/2/2014 X
12 C F 2,3 11.2 7 10/30/2013 X
13 AA
F
3 7.3
1
1/19/2014
X
14 AA F 2 10.4 2 1/5/14
X
15 c M 3 8.7 6 2/27/2014 X
16 C F 3 9.6 18 9/17/2013
X
17 C M 1 5.5 15 11/8/2013
18 O F 1,2,3 8.8 6 10/18/2013
19 O M 3 5.7 4 2/19/2014
20 AA f 2 9.7 9 1/24/2014 X
Site 1
Site 2
Site 3
Use of merged cells to
group observations by
site.
Multiple responses
in one variable,
separated by
commas.
Use of text in numeric
variables (the ‘<sign
for WBC and ‘wks’ for
length of stay).
Using text to denote Race
and Gender in an
inconsistent way (e.g., ‘AA’
or ‘A A’, both upper and
lowercase, ‘M’ and ‘male’).
Use of a dash (‘-‘) to
denote missing gender.
Inconsistent dates
(both 2 digit year
and 4 digit year
used), text
‘UNKNOWN’ used.
Blank cells: do they
indicate no
complications, or
missing data?
Both ‘x’ and ‘X’
used to denote
complications.
Color-coding
observations
to denote
groups.
Special
characters
(‘/’), spaces in
the variable
names. Some
variable names
are rather
long.
Excel Spreadsheet Data Entry Tips Page 12
Prepared by the Center for Biostatistics (http://www.biostatistics.osu.edu) Revised: 04/15/2015
Example: Spreadsheet with Data Issues – FIXED!
The following hypothetical spreadsheet is now ready for analysis by a statistician.
site patient_ID group race gender comorbid_1 comorbid_2 comorbid_3 WBC LOS date complications
1 1 1 C M 0 1 0 9.6 3 3/26/2014 1
1 2 1 C M 0 0 1 6.7 6 11/24/2013 1
1 3 1 AA M 1 0 0 12.2 13 2/8/2014 0
1 4 2 AA F 1 1 0 7.8 8 3/19/2014 1
1 5 2 O M 0 0 1 8.3 21
1
1 6 3 AA F 0 1 0 6.4 13 2/1/2014 0
2 7 2 AA F 1 0 1 4.9 9 4/13/2014 0
2 8 2 C M 0 1 0 4.9 2 9/14/2013 0
2 9 2 O
0 1 0 10.4 6 12/5/2013 0
2 10 1 C M 1 0 0 8.3 8 3/2/2014 0
2 11 3 C M 0 1 0 4.9 35 4/2/2014 1
2 12 3 C F 0 1 1 11.2 7 10/30/2013 1
2 13 3 AA F 0 0 1 7.3 1 1/19/2014 1
2 14 3 AA F 0 1 0 10.4 2 1/5/2014 1
2 15 1 C M 0 0 1 8.7 6 2/27/2014 1
2 16 1 C F 0 0 1 9.6 18 9/17/2013 1
2 17 2 C M 1 0 0 5.5 15 11/8/2013 0
3 18 2 O F 1 1 1 8.8 6 10/18/2013 0
3 19 2 O M 0 0 1 5.7 4 2/19/2014 0
3 20 2 AA F 0 1 0 9.7 9 1/24/2014 1
Instead of merged
cells, each patient has
site listed individually.
Race and Gender
entries are now
consistent (for
example, ‘C’ alone
denotes Caucasian).
A blank cell is used
to indicate missing
data.
Instead of multiple
responses in one variable,
created separate 0/1
variables for each possible
response (0=does not have
comorbidity, 1 = has
comorbidity).
Removed all text
from numeric
variables. Replaced
‘<5’ with ‘4.9’ after
discussions between
investigator and
statistician.
Replaced color-
coding with a
new variable to
identify group.
Dates are all now
of the form
MM/DD/YYYY.
Replaced
‘UNKNOWN’ with
blank cell.
Changed to 0/1
variable instead of text.
Patients without
complications are now
denoted as ‘0’ to avoid
being confused with
missing data.
Short, concise variable
names with no spaces
or special characters.
Use of underscore (‘_’)
to separate words.