Probabilistic Lookup

Posted on July 12, 2006 by John Walkenbach

Here’s a challenge for someone. Well, maybe it’s easy, but it’s challenging for me. For lack of a better name, I call it probabilistic lookup. Say you have a list of names, ordered by popularity (according to the U.S. Census Bureau):

Mary
Nancy
Pamela
Beatrice
Deborah
Enid
Lolita
Delta
Mayola
Concha

How can you randomly select a name such that names at the top of the list are more likely to be selected than names at the bottom of the list? In other words, if you randomly select 1,000 names from this list and then do a sorted frequency distribution of the selected names, they should (roughly) appear in the same order as the original list.

It needs to be generalizable to handle a much larger list (~10,000) . I’m not seeking statistical perfection, just a reasonably accurate way to give names higher on the list a greater chance of being selected, and names lower on the list a lower chance of being selected.

I’d prefer a formula solution, if possible.

20 thoughts on “Probabilistic Lookup”

Jack Dahlgren says:

July 12, 2006 at 6:34 pm

I can think of a number of ways.
Here is a pretty simple one.
Reverse the list order.
Add a column
Autonumber 1 through n+1 (assuming n names)
In the next column use a formula
The first cell is 1, the next one down = cell above + the autonumbered cell
Drag all the way down to one beyond the last name.
This is now your lookup table.

Next use random() and multiply by the last value
Use that as the input to your lookup table and return the name associated with that number.

If you had 4 names the first name would have 1 chance in 10, the second 2 chances in ten etc…

1 Bob
3 Carol
6 Ted
10 Alice
James Low says:

July 12, 2006 at 8:31 pm

Hi, not sure if this is the best way but:

http://jameslow.com/ProbabilisticLookup.xls
Shawn Stackhouse says:

July 12, 2006 at 8:40 pm

Hi John,

How about this solution:

Assign your name list a named range of Names.

This formula will return the names at the top of the list more often:

=INDEX(Names,INT(COUNTA(Names)-(SQRT(2*(RAND()*COUNTA(Names)*(COUNTA(Names)+1)/2)+0.25)-0.5))+1)

The weights work like this:

e.g. If there are 3 names in the list
Name 1 is returned around 3/6 of the time
Name 2 is returned around 2/6 of the time
Name 3 is returned around 1/6 of the time

e.g. If there are 4 names in the list
Name 1 is returned around 4/10 of the time
Name 2 is returned around 3/10 of the time
Name 3 is returned around 2/10 of the time
Name 4 is returned around 1/10 of the time

etc.

If you’re interested in the math, then I could explain it!

Thanks,
Shawn
Bill Simoni says:

July 12, 2006 at 9:48 pm

John – I’ve taken a stab at it. My solution requires some sort of frequency to be specified. Lacking one, I assumed the top item in a list of n items happens n times, the second item happens (n-1) times, and so forth. I suppose it’s possible some sort of megaformula could be made to handle it all in one calc, but I’d prefer not to try and wrap my brain around that…

I’ve posted the file at my site so anybody can download and play with it: http://bill.simonifamily.net/2006/07/12/probabilistic-lookup-in-excel/
Jason Morin says:

July 12, 2006 at 10:06 pm

Here’s a start. Assuming the names are in A1:A10, I assigned probabilities to each name in B1:B10 and then sorted both columns ascending based on column B. So I have:

Delta 0.005
Mayola 0.005
Concha 0.005
Lolita 0.01
Enid 0.025
Beatrice 0.05
Deborah 0.05
Pamela 0.1
Nancy 0.25
Mary 0.5

I used a helper column (column C) by multiplying each probability by 10000 (cell G1) and then adding the values cumulatively down col. C. So C1 has: =B1*$G$1 and C2: =B2*$G$1+C1. Copy the formula in C2 down to C10. Now I have 3 columns that look like:

Delta 0.00550
Mayola 0.005100
Concha 0.005150
Lolita 0.01250
Enid 0.025500
Beatrice 0.051000
Deborah 0.051500
Pamela 0.12500
Nancy 0.255000
Mary 0.510000

From here I used an INDEX/MATCH combo with RANDBETWEEN selecting a whole number between 0 and 9999:

=INDEX(A1:A10,MATCH(RANDBETWEEN(0,G1-1),C1:C10,1)+1)

Of course the biggest issue is that the numbers 0-49 will product #N/A, and the formula will never return “Delta.” That should be an easy workaround but I can’t find one at the moment. I’ll address later.

I also picked “pretty” probabilities in this example.
Sam says:

July 12, 2006 at 11:24 pm

Hi

Formula in Colum C =IF(RANDBETWEEN(1,26)>ROW()=TRUE,A2,0)
Formula in Colum D =IF(C2=0,””,IF(AND(C20),”T”,”B”))
This gives more random “T”‘s than “B”‘s

Regards
Sam

A B C D
SlNameTOP/BOTTOM
1a1TTop1077%
2b0Bottom323%
3c3T13
4d4T
5e5T
6f6T
7g0
8h8T
9i9T
10j0
11k11T
12l12T
13m13T
14n14B
15o0
16p0
17q17B
18r0
19s0
20t0
21u21B
22v0
23w0
24×0
25y0
26z0
XLurker says:

July 13, 2006 at 12:25 am

Hi John,

If the list is in, say, cells A1 to A10, then you could make your selections with a formula like:

=OFFSET($A$1,TRUNC(RAND()*RAND()*COUNTA($A$1:$A$10),0),0)

Cheers,

XLurker
Mike Alexander says:

July 13, 2006 at 12:30 am

Six Steps:

1. Copy list to A1:A10
2. Sort list by smallest frequency of occurance to largest
3. Enter 0 in Cell B1
4. Enter =INT(AVERAGE(B1+2,(B1+2)*2)) in cell B2
5. Copy formula down to Cell B11 (Yes…one row below the end the of list)
6. Enter this formula anywhere =INDEX($A$1:$A$10,(MATCH(INT((RAND()*MAX($B:$B)+1)),$B$1:$B$11,1)))

Copy formula down to 1000 rows, and you will get a distribution that should support the premise of your original list. The more rows you include, the closer you get to the correct distribution.
The formula in step 4 could be tweaked for better results.
Jasser says:

July 13, 2006 at 12:33 am

Hi,

here’s my try:

=INDEX(INDIRECT(“A1:A”&COUNTA($A:$A)),COUNTA($A:$A)-INT(LOG10(10*(9/10*RAND()+0.1))*COUNTA($A:$A)))

Assuming that your list of names starts at A1.

-Asser

First is to generate an index table:
“Occurance” is the relative indvidual frequency of the names in ascending order. It does not matter how these are scaled.
“Prob” is the cumulative probability of that name being returned. The top name will always be 1.000

The formula in the Prob column is:
= SUM(B2:$B$11)/SUM($B$2:$B$11)

	A	B	C
1	Names	Occurance	Prob
2	Mary	15.10%	1.000
3	Nancy	14.90%	0.765
4	Pamela	9.10%	0.533
5	Beatrice	6.10%	0.392
6	Deborah	5.60%	0.297
7	Enid	4.80%	0.210
8	Lolita	3.10%	0.135
9	Delta	2.40%	0.087
10	Mayola	1.90%	0.050
11	Concha	1.30%	0.020

To actually return names the formula is:
=INDEX(Names,MATCH(RAND(),Prob,-1))

Where:
“Names” represents A$2:$A$11
“Prob” represents C$2:$C$11

Example Result for a block of 10000 selections
FWIW formula to complie this was:
=COUNTIF($J$1:$S$1000,A2) – to Give answer for “Mary”

Mary 2337
Nancy 2316
Pamela 1325
Beatrice 960
Deborah 922
Enid 783
Lolita 520
Delta 348
Mayola 308
Concha 181

Mpemba

Editor: Mpemba, I’ve been using your comment to play around with the formatting.

Mpemba says:

July 13, 2006 at 4:04 am

Darn that formatting: it looked fine in the preview window (Courier) !

I should add, if you want a working copy of the spreadsheet I’d gladly mail it.

M
XLurker says:

July 13, 2006 at 4:47 am

Hi,

If the list is in, say, cells A1 to A10, then how about making your selections with a formula like this:

=OFFSET($A$1,TRUNC(RAND()^2*COUNTA($A$1:$A$10),0),0)

The COUNTA is for scalability of the list. Set the power of 2 higher to skew the selection more to the top of the list, or between 0 and 1 to skew the selection to the bottom of the list.

Cheers,

XLurker
Mpemba says:

July 13, 2006 at 5:32 am

Me again:
In hindsight I should add that in post 4 my “Occurance” column does not have to be sorted in to ascending order. The order does not matter a jot using this method.

I must also admit I found it surprising Mary was at the top of the example list. The 2005 data for the USA listed at:
http://www.socialsecurity.gov/OACT/babynames/

Top ten for the 2000s
http://www.socialsecurity.gov/OACT/babynames/decades/names2000s.html
Emily 149,420 1.2525%
Madison 123,729 1.0372%
Hannah 110,081 0.9228%
Emma 106,428 0.8921%
Ashley 91,644 0.7682%
Abigail 89,848 0.7532%
Alexis 89,512 0.7503%
Olivia 88,971 0.7458%
Samantha 88,669 0.7433%
Sarah 85,747 0.7188%
Gareth says:

July 13, 2006 at 6:16 am

RNDBTWN? thght mst b mssng smthng, hng n tls > dd-ns > nlyss tlpk.

It’s making a lot more sense now.
J Walker says:

July 13, 2006 at 6:45 am

Mpemba:
You shouldn’t be surprised Mary was at the top of the list. If you go to the site you linked and check the “Popularity of a Name” on Female for the past 100 years using the name Mary, you’ll find it was a very popular name for a very long time.
John Walkenbach says:

July 13, 2006 at 7:48 am

Thanks to everyone. I haven’t gone through them all yet, but this is a great help.

BTW, here’s where I got the names from:

http://www.census.gov/genealogy/names/names_files.html

And each name has a frequency.
Shawn Stackhouse says:

July 13, 2006 at 8:19 am

For a graphical representation of name popularity over the decades, check out the totally fun Baby Name Wizard at http://www.babynamewizard.com/namevoyager/

Poor Mary ain’t what she used to be!

From these charts, you can easily pinpoint when I was born. :)
Andrew Reynolds says:

July 13, 2006 at 11:49 am

{=OFFSET($A$1,MATCH(RAND(),1-EXPONDIST(ROW(1:[n])-1,[tweakable constant],TRUE),-1),0)} … so entered as an array formula over the required output cells.

A1 is one cell above the input list
[n] is the upper bound for the number of input values + 1
[tweakable constant] controls the decay in probabilities for the exponential function. 0.1 ‘works’ over 10,000 cells, 0.9 can be seen to ‘work’ for a shorter list. 0.1 is very ‘slow’ decay in probability.
Cyrus Sujar... says:

July 15, 2008 at 10:44 am

Statistical Perfection yields the best results. Sorry so late but I was helping someone with homework and came across this unsatisfactory answer. What you need is an “adjusted cumulative frequency” column to the left of the names. Luckily, the data includes the cumulative frequency:

adj.cf name freq cum.freq(%) rank
0 MARY 2.629 2.629 1
.02629 PATRICIA 1.073 3.702 2
.03702 LINDA 1.035 4.736 3
.04736 BARBARA 0.980 5.716 4
.05716 ELIZABETH 0.937 6.653 5

=VLOOKUP(RAND(), $A$2:$B$6, 2)
or if you use the actual data
=VLOOKUP(RAND(), $A$2:$B$4275, 2)

http://www.excelforum.com/showthread.php?t=628930
shows similar question with Letters of alphabet
Bill G. says:

December 13, 2010 at 12:04 pm

Thanks to Dick and all the contributors of this site. I have picked up hundreds of tips here. Many thanks.

I am trying to develop a function that will count the number of runs within a range with one run being defined as when the value encounters a sign change (i.e. from positive to negative or negative to positive). Consider the following range: starting in cell A1 and continuing down to cell A10 I have these numbers: 125, 200, 175, -50, -75, 135, -65, -225, -90, and 115. The first three values (125, 200, 175) are positive and are considered as one run (a winning streak). The next two values are negative (-50, -75) thus are considered another run (a losing streak). The next value (135) is positive and is considered another run. The next three values (-65, -225, -90) are negative and thus is considered another run. Finally, the last value in the range (115) is positive, thus is considered another run. Therefore, for this example range the total number of runs 5. I have developed functions to calculate the longest winning and losing streaks, but I am really hitting a wall developing a function to calculate the total number of runs within a range. Any suggestions on how to approach this problem would be greatly appreciated. Thanks.