Random Name Generator

Yesterday I posted a request for a method to perform a probabilistic lookup. Lots of good ideas. I chose the method suggested by Mpemba, mainly because he demonstrated that it works.

My goal was to create a workbook that generates a list of random fake names. I got it down to a single formula that uses two parameters: The probability of a male name, and the probability of having a middle initial. I think there’s about 2.7 billion unique names than can be generated. This sort of thing has lots of uses: generating data for samples files, and… uh… Well, I’m sure it’s good for something else. Anyone in the identity theft industry?

Here are some random names that it spit out: Delores A. Spear, Jason Dyer, David I. Hearn, Paulette Crissman, Gerald Olson, Kathleen Oleary, Joan Smith, Curtis Watt, Latina Murguia, Boyd Therrien, Rita Z. Morales, David Rapp, Ma Anderson, Kurt Cunningham, Nancy Hackney, Stacey I. Harris, Heath Hunter, Trevor Sisson, Carol Papenfuss, Patrick Lenhardt, Ian Grayson, Earl T. Wiggins, John Brooks, Billy Barkley, Antionette Adkisson, Edward I. Gibson, Lisa Stillman, Amanda Arthur, Lillian L. Archuleta, Virginia Ryan, Barbara Fritts, Robert Jones, Jacqueline Lowe, Donald Steiner, and Joyce Brown.

These names certainly sound real to me. In fact, I think I went to elementary school with Barbara Fritts.

In any case, feel free to download it and try it out: namegenerator.zip. The ZIP file is 790K, and the XLS file is about 3.3Mb.

Posted in Uncategorized

37 thoughts on “Random Name Generator

  1. There’s a lady in my church named Joyce Brown :)

    Nice work.

    Die_Another_Day

  2. Don’t you mean xlsx file? I find it curious that out of 20 names and 7 middle initials, that three would be “I”. I wonder what kind of ethnic incompatabilities you’d get, like Conchita Stolinski.

  3. Let me edit that idea.

    The world’s first Excel drinking game.
    See how many F9’s it takes to get a name like this: Ima Toole

    You could market it to colleges.

  4. I was writing about pivot tables. I was trying to find a good set of data to use as an example file. I could have used names from my PUP sales database, but that would probably be a privacy violation. So, one thing led to another thing, and this turned out to be a lot more interesting. I actually put off doing the two pivot table chapters, and ended up doing four easy ones instead.

    Regarding the middle initials. The letter I is pretty low on the list, so getting 3 I’s out of 7 is very unlikely. Somebody can probably calculate the exact probability.

  5. Now I know who to thank when my spam comes from names like:

    “Delores A. Spear, Jason Dyer, David I. Hearn, Paulette Crissman”

    instead of “Yolikak, Meity_ie, Halooing and Ratih Kusumaningram” like it does now.

  6. Thanks this is great. I especially like using the data.

    About 10 years ago I wanted to find out the “median name” (that is, half the population has a more popular name and half has a less popular name). I do not remember which women’s name was median but “Keith” has the median male name and “McCracken” was the median last name. In your data “Keith” is still the median name (it’s refreshing to know good old Keith is still in the middle). “McCracken” was nearly the median last name again but had unfortunately drifted a bit into unpopularity. “Spangler” is the median name now.

    “Hazel” is the “median” female name (none of the names except maybe “Rhonda” around the median rung a bell as having been the median last time). Anyway, Keith and Hazel Spangler of Anytown, USA–Best Wishes!

  7. “You may be surprised to receive this letter from me
    since you do not know me personally. The purpose of my
    introduction is that I am Jason Dyer, the first son
    of DAVID I. HEARN, the most popular black farmer
    in Zimbabwe who was recently murdered in the land
    dispute in my country…”

  8. Doh! I just looked at the date on the source file and it is possible that 10 years this was the same data. I remember Keith and Samuel being the two median names and they are in this data. The “McCracken” thing was probably just my mistake as I wasn’t using Excel!

  9. All,

    I like the post by Mpemba except this excerpt:

    “Occurance” is the relative indvidual frequency of the names in ascending order. It does not matter how these are scaled.

    I argue the traditional garbage in, garbage out. From a statistical standpoint it is easy to make your name generator match the distribution of names in the cencus (http://www.census.gov/genealogy/names/names_files.html). Note that they only give you the top 90% of names. So add a column (E) with the following formula: =B2/SUM($B$2:$B$4276) starting in E2. This is the normalized probability density function. Add a column (F) with the following formula: =SUM(E2:$E$4276) starting in F2. This is the normalized cumulative distribution function. Apply Mpemba’s formula to column F, and your name generator is now statistically identical to the census’s representation of the population, minus the very obsure names in the bottom ten percent.

    -Jason

  10. I’m confused how the excerpt bears any relavence to the points you make?
    Of course, it was wrong, I said that already. Ascending order was unimportant.

    Garbage in Garbage Out: always true, always will be. Whether the names are right or not is not an Excel problem though. It does not matter whether the list of names stops at 90%, 20% or 99.9% of the population. It does not matter.

    Tesing the original sheet on HUGE numbers of samples gave results consistent with the cumulative “prob” column. That was the scope of the answer.

    M

  11. Wait a year or two and the names will be Jose Lopez Imilio Gonzalles. And may have this time accept your parameters for FName, MI, LName wouldn’t allow it. :rolleyes:

  12. Getting a real kick out of this spreadsheet. I extended the list to 100 names and was amazed at the percentage of names I recognized as someone I know or knew (about 1:16). What was really surprising was to find my sister’s married name not her first name, but by her nickname and married name (‘Dimple’ Anderson). Someone really tags their kid with a given name of ‘Dimple’? They gotta be from Eugene, Oregon! :-D

  13. I’m still waiting for a reply from Jason.

    I’ve used his two formulae and they give EXACTLY (meaning to the precision obtainable from a double) values in Column F as in Mpemba’s column C. So I’m still completely unclear as to what statistical point was being made..

    The biggest “flaw” by far in applying the lists of First, Second and Third names to produce combinations seemingly representative of the population is that interaction has been ignored. Some first names just don’t go with other Last names.

    1) Ethic origin – Manuel is more likely with a Latin family name for instance.
    2) Common sense – Certain combinations just don’t work (unless you happen to be Bart Simpson phoning up Mo)

  14. “Some first names just don’t go with other Last names.” – but remarkably many parents doesn’t care about that :-)

    This is, I believe, for creating good looking fake data, not a statistical representation, and it works very fine. My localized adaption is up and running. Thanks for idea and solution everybody, I can really use this !

    Best wishes Harald

  15. Hi John

    Yes, it’s at “the national bureau of statistics” (or whatever it translates to)’s webpage

    http://www.ssb.no/navn/

    Give me a hint if you need this for something and I’ll assist with scandinavian data.

    Best wishes Harald

  16. Mpemba,

    Sorry I just checked the comments. I was glad to see you wrote back. To answer John’s original problem: a large data set for toy use, the statement “It does not matter how these [occurrences] are scaled.” is true. But my point is that the names are not scaled arbitrarily. Further, we have the scaling already, so why not use it? Using your method, you will undoubtedly get a rank order that is similar to the desired rank order. But the problem is that the proportion of names will be off. (EG although you will get more James than John [rank order], the proportion of James to the whole population that we have data for will be incorrect.) About the missing 10% of data: there is nothing we can do about that, unless we find someone who works at the Census Bureau. The reason I normalized the data was to make sure that the last name in the list does not get that 10% of the cdf.

    The motivation for my post was two-fold: 1) It is easier to use real data, when given, than to fabricate it. 2) I hadn’t picked any nits yet that day.

    In your July 17th post, I agree whole-heartedly with your latter point. The list of first and last names cannot generate a list of full names truly representative of the population. There is some correlation for each first name to each last name. However, as an exercise you will get many believable names, as other posters have noticed.

    Jason

  17. Jason

    I’m still confused for the simple reason that – as I said in a reply – my last column is exactly and precisiely the same as yours. It is normalised. I put your formulae alongside mine to check.

    The comment about it not mattering a jot is precisely because we both normalise. There’s nowt you can do about the missing “10%” so we just normalise to make the sum of what we have equal 1.00

    As I said too: I tested the formulae on hundreds of thousands of names and there proportions in each name bin were exactly consistent with the fractions: it mattered not whether we were dealing with 90% of all names or only 50%

    If you thought you were picking nits I’m still convinced you were looking in the wrong place :o)

    M

  18. I agree that your column C and my column F are equivalent equations. The source of my comment was your values in column B, which seem to be invented, at least in your post. When you refer to to getting exactly the same result, are you referring to {Mary 15.10%, Nancy 14.90%, Pamela 9.10%} or {MARY 2.629%, PATRICIA 1.073%, LINDA 1.035%}?? The first is as given in column B of your post; the second is from the cencus.

  19. That’s resolved that then: the values were indeed invented. I simply made them up [typed in the first numbers that came to mind] to give something for the formulae to work with. It was an Excel question and for demonstrating the Excel aspects it matters not at all whether I used made up data or (inaccurate) census data :o)

    The census misses a lot of “illegals” …

  20. The xls file shrinks to about 1.8MB if you delete all the empty rows starting at 18845.

  21. If you really want a big list of valid people names, I suggest you download the DirectoryMark tool (for benchmarking X.500 and LDAP directories). This contains, as far as I remember, a script and base data for generating millions of unique, live-live names.

  22. This is a fabulous tool- I’m using it to test a randomization macro that someone wanted me to develop at work to use to generate a list of 20 names from ~950 employees for purposes of determining who wins door prizes at our Spring get-together! Thanks so much for making this available!

  23. John,

    Is there any way you could e-mail this to me? Looks like a neat tool, but I can’t download it because Websense blocks the download (Message Boards and Clubs). Thanks.

    Kevin


Posting code? Use <pre> tags for VBA and <code> tags for inline.

Leave a Reply

Your email address will not be published.