Yesterday I posted a request for a method to perform a probabilistic lookup. Lots of good ideas. I chose the method suggested by Mpemba, mainly because he demonstrated that it works.
My goal was to create a workbook that generates a list of random fake names. I got it down to a single formula that uses two parameters: The probability of a male name, and the probability of having a middle initial. I think there’s about 2.7 billion unique names than can be generated. This sort of thing has lots of uses: generating data for samples files, and… uh… Well, I’m sure it’s good for something else. Anyone in the identity theft industry?
Here are some random names that it spit out: Delores A. Spear, Jason Dyer, David I. Hearn, Paulette Crissman, Gerald Olson, Kathleen Oleary, Joan Smith, Curtis Watt, Latina Murguia, Boyd Therrien, Rita Z. Morales, David Rapp, Ma Anderson, Kurt Cunningham, Nancy Hackney, Stacey I. Harris, Heath Hunter, Trevor Sisson, Carol Papenfuss, Patrick Lenhardt, Ian Grayson, Earl T. Wiggins, John Brooks, Billy Barkley, Antionette Adkisson, Edward I. Gibson, Lisa Stillman, Amanda Arthur, Lillian L. Archuleta, Virginia Ryan, Barbara Fritts, Robert Jones, Jacqueline Lowe, Donald Steiner, and Joyce Brown.
These names certainly sound real to me. In fact, I think I went to elementary school with Barbara Fritts.
In any case, feel free to download it and try it out: namegenerator.zip. The ZIP file is 790K, and the XLS file is about 3.3Mb.
Hey! I thought you were supposed to be writing about pivot tables?
There’s a lady in my church named Joyce Brown :)
Nice work.
Die_Another_Day
Don’t you mean xlsx file? I find it curious that out of 20 names and 7 middle initials, that three would be “I”. I wonder what kind of ethnic incompatabilities you’d get, like Conchita Stolinski.
” Well, I’m sure it’s good for something else”
Here’s a use…
See how many F9’s it takes to get a name like this
Ima Toole
Roleplayers would love it.
Let me edit that idea.
The world’s first Excel drinking game.
See how many F9’s it takes to get a name like this: Ima Toole
You could market it to colleges.
I was writing about pivot tables. I was trying to find a good set of data to use as an example file. I could have used names from my PUP sales database, but that would probably be a privacy violation. So, one thing led to another thing, and this turned out to be a lot more interesting. I actually put off doing the two pivot table chapters, and ended up doing four easy ones instead.
Regarding the middle initials. The letter I is pretty low on the list, so getting 3 I’s out of 7 is very unlikely. Somebody can probably calculate the exact probability.
Now I know who to thank when my spam comes from names like:
“Delores A. Spear, Jason Dyer, David I. Hearn, Paulette Crissman”
instead of “Yolikak, Meity_ie, Halooing and Ratih Kusumaningram” like it does now.
Thanks this is great. I especially like using the data.
About 10 years ago I wanted to find out the “median name” (that is, half the population has a more popular name and half has a less popular name). I do not remember which women’s name was median but “Keith” has the median male name and “McCracken” was the median last name. In your data “Keith” is still the median name (it’s refreshing to know good old Keith is still in the middle). “McCracken” was nearly the median last name again but had unfortunately drifted a bit into unpopularity. “Spangler” is the median name now.
“Hazel” is the “median” female name (none of the names except maybe “Rhonda” around the median rung a bell as having been the median last time). Anyway, Keith and Hazel Spangler of Anytown, USABest Wishes!
“You may be surprised to receive this letter from me
since you do not know me personally. The purpose of my
introduction is that I am Jason Dyer, the first son
of DAVID I. HEARN, the most popular black farmer
in Zimbabwe who was recently murdered in the land
dispute in my country…”
Doh! I just looked at the date on the source file and it is possible that 10 years this was the same data. I remember Keith and Samuel being the two median names and they are in this data. The “McCracken” thing was probably just my mistake as I wasn’t using Excel!
Talking about Median names
have a look here
http://www.babynamewizard.com/namevoyager/
This is a fantastic chronology of names vs popularity
I’ll bet you try all your family and friends
Hui…
All,
I like the post by Mpemba except this excerpt:
“Occurance” is the relative indvidual frequency of the names in ascending order. It does not matter how these are scaled.
I argue the traditional garbage in, garbage out. From a statistical standpoint it is easy to make your name generator match the distribution of names in the cencus (http://www.census.gov/genealogy/names/names_files.html). Note that they only give you the top 90% of names. So add a column (E) with the following formula: =B2/SUM($B$2:$B$4276) starting in E2. This is the normalized probability density function. Add a column (F) with the following formula: =SUM(E2:$E$4276) starting in F2. This is the normalized cumulative distribution function. Apply Mpemba’s formula to column F, and your name generator is now statistically identical to the census’s representation of the population, minus the very obsure names in the bottom ten percent.
-Jason
I’m confused how the excerpt bears any relavence to the points you make?
Of course, it was wrong, I said that already. Ascending order was unimportant.
Garbage in Garbage Out: always true, always will be. Whether the names are right or not is not an Excel problem though. It does not matter whether the list of names stops at 90%, 20% or 99.9% of the population. It does not matter.
Tesing the original sheet on HUGE numbers of samples gave results consistent with the cumulative “prob” column. That was the scope of the answer.
M
Wait a year or two and the names will be Jose Lopez Imilio Gonzalles. And may have this time accept your parameters for FName, MI, LName wouldn’t allow it. :rolleyes:
Getting a real kick out of this spreadsheet. I extended the list to 100 names and was amazed at the percentage of names I recognized as someone I know or knew (about 1:16). What was really surprising was to find my sister’s married name not her first name, but by her nickname and married name (‘Dimple’ Anderson). Someone really tags their kid with a given name of ‘Dimple’? They gotta be from Eugene, Oregon! :-D
I’m still waiting for a reply from Jason.
I’ve used his two formulae and they give EXACTLY (meaning to the precision obtainable from a double) values in Column F as in Mpemba’s column C. So I’m still completely unclear as to what statistical point was being made..
The biggest “flaw” by far in applying the lists of First, Second and Third names to produce combinations seemingly representative of the population is that interaction has been ignored. Some first names just don’t go with other Last names.
1) Ethic origin – Manuel is more likely with a Latin family name for instance.
2) Common sense – Certain combinations just don’t work (unless you happen to be Bart Simpson phoning up Mo)
“Some first names just don’t go with other Last names.” – but remarkably many parents doesn’t care about that :-)
This is, I believe, for creating good looking fake data, not a statistical representation, and it works very fine. My localized adaption is up and running. Thanks for idea and solution everybody, I can really use this !
Best wishes Harald
Harald, by “localized,” do you mean you have a list of Norwegian names?
Soemhow, these names didn’t make the list:
http://www.ruf.rice.edu/~pound/sfng.html ;-)
…mrt
Hi John
Yes, it’s at “the national bureau of statistics” (or whatever it translates to)’s webpage
http://www.ssb.no/navn/
Give me a hint if you need this for something and I’ll assist with scandinavian data.
Best wishes Harald
Mpemba,
Sorry I just checked the comments. I was glad to see you wrote back. To answer John’s original problem: a large data set for toy use, the statement “It does not matter how these [occurrences] are scaled.” is true. But my point is that the names are not scaled arbitrarily. Further, we have the scaling already, so why not use it? Using your method, you will undoubtedly get a rank order that is similar to the desired rank order. But the problem is that the proportion of names will be off. (EG although you will get more James than John [rank order], the proportion of James to the whole population that we have data for will be incorrect.) About the missing 10% of data: there is nothing we can do about that, unless we find someone who works at the Census Bureau. The reason I normalized the data was to make sure that the last name in the list does not get that 10% of the cdf.
The motivation for my post was two-fold: 1) It is easier to use real data, when given, than to fabricate it. 2) I hadn’t picked any nits yet that day.
In your July 17th post, I agree whole-heartedly with your latter point. The list of first and last names cannot generate a list of full names truly representative of the population. There is some correlation for each first name to each last name. However, as an exercise you will get many believable names, as other posters have noticed.
Jason
Jason
I’m still confused for the simple reason that – as I said in a reply – my last column is exactly and precisiely the same as yours. It is normalised. I put your formulae alongside mine to check.
The comment about it not mattering a jot is precisely because we both normalise. There’s nowt you can do about the missing “10%” so we just normalise to make the sum of what we have equal 1.00
As I said too: I tested the formulae on hundreds of thousands of names and there proportions in each name bin were exactly consistent with the fractions: it mattered not whether we were dealing with 90% of all names or only 50%
If you thought you were picking nits I’m still convinced you were looking in the wrong place :o)
M
I agree that your column C and my column F are equivalent equations. The source of my comment was your values in column B, which seem to be invented, at least in your post. When you refer to to getting exactly the same result, are you referring to {Mary 15.10%, Nancy 14.90%, Pamela 9.10%} or {MARY 2.629%, PATRICIA 1.073%, LINDA 1.035%}?? The first is as given in column B of your post; the second is from the cencus.
That’s resolved that then: the values were indeed invented. I simply made them up [typed in the first numbers that came to mind] to give something for the formulae to work with. It was an Excel question and for demonstrating the Excel aspects it matters not at all whether I used made up data or (inaccurate) census data :o)
The census misses a lot of “illegals” …
The xls file shrinks to about 1.8MB if you delete all the empty rows starting at 18845.
If you really want a big list of valid people names, I suggest you download the DirectoryMark tool (for benchmarking X.500 and LDAP directories). This contains, as far as I remember, a script and base data for generating millions of unique, live-live names.
Oops, should have provided a link
http://www.mindcraft.com/directorymark/
Just came across this in one of my RSS feeds: http://www.fakenamegenerator.com/index.php
Looks like you can order in bulk, too. I wonder what the chance is that these actually contain real combinations for our friendly spammers?
This is a fabulous tool- I’m using it to test a randomization macro that someone wanted me to develop at work to use to generate a list of 20 names from ~950 employees for purposes of determining who wins door prizes at our Spring get-together! Thanks so much for making this available!
Nancy:
This is *not* the right solution for you. It’s a great way to generate *fake* names that appear real.
You should look at options such as
Random Selection
http://www.tushar-mehta.com/excel/newsgroups/rand_selection/index.html
John,
Is there any way you could e-mail this to me? Looks like a neat tool, but I can’t download it because Websense blocks the download (Message Boards and Clubs). Thanks.
Kevin
thanks, this excel file is awesome
It is very useful file… awesome work….
Tahnks…
This tool is great- thank you!
Works perfectly thanks !!!
Very helpful thank you!!!!!