The Metropolitan Museum of Art, New York
The first step in a good data hygiene practice is to codify your file structure and naming conventions. To do this, you will want to:
Create a uniform Naming Convention for naming files and stick to it!
Be descriptive- don’t assume that the next researcher, or even future you, will know what those abbreviations meant
Keep Folder Trees and File Formats consistent.
Speaking of File Formats!
Remember that file formats are not forever. It is best practice to use open source file formats when storing your data. These are most likely to be retrievable in the long term because they are not controlled by a specific software developer like a proprietary file format is. A common open source format for storing data from a spreadsheet is CSV, or Comma Separated Value.
Proprietary file formats are subject to the whims and longevity of their makers and you are more likely to have trouble accessing it in the future. It may even become impossible! An example of a proprietary file format you might come across often in your research is Microsoft Excel or XLS.
Consider whether you need lossless formats or if a "lossy" one is ok. Lossless formats don’t lose data due to compression or encryption. This is great if you want to have full control and flexibility with your files in the future. However, they can take up a ton of disk space. Some examples of lossless formats are RAW and TIFF.
"Lossy" formats keep less data in your file, but also take up much less space. Lossy formats might be fine for much of your data storage, especially if there is no reason to think you will need to manipulate it in the future. For example, if you are photographing letters in an archive for transcription, then you might not need to worry about maintaining huge lossless files. Readable is good enough. For example, a "lossy" file format you might chose for storing photos is JPEG or JPG.
So now you have collected all this data, and it is looking clean as a whistle. Pat yourself on the back! The hard part is over, but you are not done yet!
You would hate to put all that work into your pristine data just to lose it. You need to make a plan for how you are going to store and backup that data.
Let me suggest the 1, 2, 3 plan.
Level 1: The local home of your data. This is likely to be the computer you are using to work with your data. Level 1 is also where you do all your housekeeping now that you are a master at Data Hygiene.
Level 2: This is your first level of backup. Common choices for Level 2 of your preservation plan are external hard drives or solid state drives. Just remember, if you are traveling with this drive often, it is susceptible to damage or loss. If you can, try to have a backup drive that stays put and be explicit about the schedule you plan to use for backing up. If you can’t automate it, set a reminder in your calendar! Time flies when you’re having fun,and before you know it, you’ve gone a month without backing up your data.
Level 3: This is ideally off-site. By storing a copy of your data off-site, you nearly eliminate the risk of disaster or theft wiping out all your copies at once. A common way to backup your data off-site is by storing your data in the cloud. An added benefit of backing your data up to the cloud is that many platforms allow you to run backups automatically. This is a great point to check with your institution to see what cloud storage options they offer. Many colleges and universities provide some amount of cloud storage for free.
A note about the cloud! Many teams use cloud based platforms as their primary workspace. This is great for facilitating collaboration, but don’t forget to back it up! If you are manipulating your data in the cloud, then the cloud is your Level 1. Set a schedule for downloading your data to a local hard drive at the very least. There are too many ways to lose access to your cloud services to leave it up to chance.
What’s that about losing access?
That’s right. Be aware of how and when you might lose access to your cloud data. Many universities only offer cloud storage while you are enrolled with or working for them. This includes your email! Don't forget to download your important email data so that you have access to notes and correspondence in perpetuity.
Performing regular audits to refresh files and perform checksums on files to ensure they have not changed at the bit level will also keep you from suddenly losing access.
Lastly, Remember those abbreviations mentioned above? It can be helpful to use abbreviations or shorthand in your data organization for many reasons. It is important practice to maintain a Codebook or README file to explain the details of your data. Your future self will thank you. And if your data is made open access after your project is completed, future researchers will be able to interpret your data and build upon your work.
As mentioned in the previous section, Structuring Your Data, there are some great, open source, data management programs available that can help you organize and annotate your data. Two of them that are well suited for humanities work are Tropy and OpenRefine.
Let's Review!
The safest choices for file formats are open source formats because they are not tied to the functionality of a specific software program. By choosing open source formats you are protecting your data from being locked in an obsolete format.
1. Your local or working storage location
2. An external backup copy of your data
3. A second backup copy of your data, preferably off-site
A codebook or README file which annotates your data and can provide an explanation of any abbreviations, standards, systems, or other data management choices you made when organizing your data.