Hank Rappaport, 28 April 2015
Part 1 – Definitions
I gave a talk recently designed to help non-techies understand what a database is, what it can do for them, and simple techniques for getting started doing database style manipulations of their information. In this and subsequent blog posts I will share those thoughts with a larger audience.
This blog assumes you are familiar with the basics of using Microsoft Excel. You are comfortable entering data into a spreadsheet and doing simple operations like getting column totals, doing sorting and filtering.
So what is a database, and how does it differ from what you can do in Excel?
I like to think of databases as once you get stuck in Excel or you have a challenge doing something in Excel, databases are probably what you need to jump to next.
So what could a database be?
How about “a place to put information”
Well, Excel is a place where you can put information too. You enter data in a worksheet, save your workbook, and you have “put your information in a safe, retrievable and sharable place”. But a database has more structure than Excel. You have to work harder to get data into a database, but there are benefits down the line. More on that later. Databases enforce a lot more rigidity as to how the data is represented, stored and managed.
How about size?
Could databases just hold more information that what Excel can handle? Excel 97 used to be limited to about 16,000 rows, and Excel 2003 allowed at most 65,000 rows. Starting with Excel 2007, your worksheets can hold a million rows per sheet, with 16,000 columns in the sheet and up to 32,000 characters per cell. That’s a lot of data. If you need more than that, you’ve probably already discovered databases.
How about “relationships”?
Yes, relationships are very important in the database world, and are hard to manage with basic Excel operations. Relationships allow you to specify how one kind of data is connected to a different kind of data. For example, take restaurants and customers. You could create an Excel worksheet holding a list of every time someone in your town went to a restaurant. Each row would be a visit by someone to a restaurant. In each row, you would have to list the person’s name and details (age, gender, zip code, maybe the brand of car they drive), the restaurant name and details (class of food, zip code, price range), and details about the visit (date, meal type (breakfast, lunch, dinner), how much was spent, how the person paid). There would be a lot of information duplicated across the rows. If a person visited 5 restaurants, the person’s name, age, zip code, etc., would have to show in each of the rows. Similarly, if a restaurant was visited by 5 customers, the restaurant information would be duplicated across the rows.
Duplicating information is very wasteful and prone to error (what happens if you realize you had the person’s zip code wrong and needed to update it? You’d have to apply the update to each of the rows that listed the person). Databases simplify this process, allowing you to create a list of restaurants, a list of people, and a “relationship” list that documents every time a person visited a restaurant. This “relationship” list just has the details of the visit. It does not hold information about each restaurant or customer, other than an identifier of restaurant and customer. Only when needed, as part of getting answers from the database, such as how many lunch meals were served to male residents of zip 22222 by restaurants in zip code 22224 serving Chinese food, are the various lists joined together. No information needs to be duplicated in storage prior to calculating the answer.
How about sharing and security, or changing data?
Yes, databases help with this. You can put lots of information in a spreadsheet, and give the spreadsheet to lots of colleagues. But what if you realized there was an error in the spreadsheet, or new information that that needs to be added? How would you get it back from them to ensure they were using correct information? A database centralizes where the information is stored, and from where it is accessed, so everyone gets the right information on demand, regardless of changes to the data.
Stay tuned for Part 2 – where we discuss relationships in more detail.