Author Archive

PowerPivot – Part 4 of 4: A geographic visual display of 18 million US Census records

Part 1: http://www.dailydoseofexcel.com/archives/2010/02/26/powerpivot-part-1-of-4/

Part 2: http://www.dailydoseofexcel.com/archives/2010/02/27/powerpivot-part-2-of-4-prepping-the-census-data/

Part 3: http://www.dailydoseofexcel.com/archives/2010/03/02/powerpivot-part-3-of-4-conditional-shape-colors/

In this tip I discuss integrating the result of a PowerPivot analysis of a large data set (18 million records) into a geographic map using a method I call “Conditional Color of Shapes.”

image001

For a version in a page by itself (i.e., not in a scrollable iframe as below) visit http://www.tushar-mehta.com/publish_train/xl_vba_cases/powerpivot_conditional_shape_color.shtml

Using the MATCH Function for a Linear Search and a Binary Search

Given the problem I have with images and blog posts, I’m trying something new. I uploaded the content of the post to my website and am using an iframe to post it here. Yes, the UI experience is somewhat different than scrolling the entire browser window but hopefully it is something most people will be OK with. And, yes, I can adjust both the height and the width of the iframe.

This approach has another benefit in that one can read the entire post on the home page itself.

PowerPivot - Part 3 of 4: Conditional Shape Colors

Part 1: http://www.dailydoseofexcel.com/archives/2010/02/26/powerpivot-part-1-of-4/

Part 2: http://www.dailydoseofexcel.com/archives/2010/02/27/powerpivot-part-2-of-4-prepping-the-census-data/

As mentioned in the Part 1, I had developed this technique to conditionally color shapes some time ago. Based on customer feedback and my own experience I made changes to simplify the various connections required to make it all work.

Also, documenting the technique was easier to do with a webpage for two reasons: 1) the length of the post, and 2) I find the UI for a blog post somewhat cumbersome for managing images.

The updated solution is at http://www.tushar-mehta.com/publish_train/xl_vba_cases/0301-dashboard-conditional_shape_colors.htm

In Part 4, the last part of this multi-part post, I will look at using PowerPivot from within Excel to analyze the 18 million row data set generated in Part 2 and on integrating the result with the Conditional Shape Color solution of this post.

picture11

PowerPivot - Part 2 of 4: Prepping the census data

Part 1: http://www.dailydoseofexcel.com/archives/2010/02/26/powerpivot-part-1-of-4/

Since Excel 2007 (and later) can handle a million rows of data, I figured it made sense to explore PowerPivot (PP) with a data set larger than that. In addition, I have had a long time curiosity about the US census data. So, this seemed an appropriate time to combine the two. After some exploration of www.census.gov, I learned of http://usa.ipums.org/usa/index.shtml This is a project of the Minnesota Population Center of the University of Minnesota.

Do note that this post is not meant to be a step-by-step tutorial on how to use the IPUMS website. While it took some time, it took even more effort to understand the census data but an explanation of the data is also not my intent with this post. The focus is on loading a large data set in PowerPivot.

The final data set I downloaded expanded to nearly 18 million rows of data. But, it took a while before I got to the point of getting that data set prepped for PP.

A relatively superficial exploration of the IPUMS site led to two conclusions. One, the result of ‘analyze data online’ would yield results online but the result sets were intentionally limited in size. Two, trying to get detailed data through the ‘Create an Extract’ or the ‘Download or Revise Extracts’ led to large data sets but of the kind that were intended for use with SPSS, SAS, or STATA. The data set itself consisted of fixed length columns. In addition, there was a delimiter at the end of each row record (discovered through trial and error).

In addition to the data set, the website provided the command sets to load the data into SPSS, SAS, and STATA.

picture1

The SPSS instructions were fairly helpful in understanding the layout and the content of the data (other than the secret delimiter).

So, for one data set the columns were

 year           1-4
 datanum        5-6
 serial         7-14
 numprec       15-16
 hhwt          17-20
 statefip      21-22
 county        23-26
 urban         27
 city          28-31
 citypop       32-36
 urbpop        37-41
 ownershpd     42-43
 mortotal      44-47
 mortamt1      48-52
 rentgrs       53-56
 hhincome      57-63
 pernum        64-67
 perwt         68-71
 age           72-74
 sex           75
 marst         76
 raced         77-79
 gradeattd     80-81
 occ1950       82-84
 ind1950       85-87
 inctot        88-94
 ftotinc       95-101

In addition, the SPSS instructions also contained information about the text values associated with the numeric codes in the data set. For example, ownershpd values were:

value labels ownershpd
00 "N/A"
 10 "Owned or being bought"
 11 "Check mark (owns?)"
 12 "Owned free and clear"
 13 "Owned with mortgage or loan"
 20 "Rented"
 21 "No cash rent"
 22 "With cash rent"

The above information is important since we can create Excel tables of these number-to-text maps and link them to the data table in PP.

What made the process of getting the data in PP challenging was that PP will not read fixed column width data from a text file though it will read comma delimited text files. I would have thought this would be elementary but I couldn’t find any way to get PP to do so. Maybe, I missed something that would have made the rest of the exercise below unnecessary.

So, as far as I could tell, I had to convert the fixed-column-width text file into a file where the columns were delimited by some character. I picked the comma. But, I couldn’t use Excel (too many records) or Word (too large a file) to do the conversion. I also tried Notepad and WordPad before falling back on true and tried VB(A).

I did not start with the 18 million record data set. I used the smaller data set for a single census from the 19th century to explore the structure of the data. Through trial and error I discovered the secret delimiter at the end of each record. I also discovered that the software did not summarize the data in any way. The code that I felt confident I could use for data from all the censuses combined is below.

It’s not elegant. It’s not flexible. It’s not meant to be either. At some point I might be tempted to paste the information about the column names and widths into a Excel worksheet and have the code automagically work off that. But, for now, the goal was to prep the data for PP.

One interesting thing I did learn was that using FreeFile() twice in succession yields the same integer value. That, of course, leads to an error when opening the 2nd file. So, the correct approach is to get a free file number, open that file, and then get a second free file number for another file.

Also, if one were to add up the column widths of the various fields one would get 101 but the record being read has 102 characters. The last byte is the delimiter that I want to drop.

Option Explicit

Sub writeOneRec(OutF As Long, Rslt As String)
    Dim Yr As Integer, DataNum As Integer, Serial As Long, _
        NumPrec As Integer, HHwt As Integer, _
        StateFIP As Integer, County As Integer, Urban As Integer, _
        City As Integer, CityPop As Long, _
        UrbPop As Integer, _
        OwnerShip As Integer, MorgTotal As Integer, _
        MorgAmt1 As Integer, RentGross As Integer, _
        HHIncome As Long, _
        PerNum As Long, PerWt As Integer, Age As Integer, _
        Sex As Integer, MarSt As Integer, Race As Integer, _
        GradeAtt As Integer, Occ1950 As Integer, Ind1950 As Integer, _
        IncTot As Long, FamTotInc As Long
    On Error Resume Next
    Yr = Left(Rslt, 4)
    DataNum = Mid(Rslt, 5, 2)
    Serial = Mid(Rslt, 7, 8)
    NumPrec = Mid(Rslt, 15, 2)
    HHwt = Mid(Rslt, 17, 4)
    StateFIP = Mid(Rslt, 21, 2)
    County = Mid(Rslt, 23, 4)
    Urban = Mid(Rslt, 27, 1)
    City = Mid(Rslt, 28, 4)
    CityPop = Mid(Rslt, 32, 5)
    UrbPop = Mid(Rslt, 37, 5)
    OwnerShip = Mid(Rslt, 42, 2)
    MorgTotal = Mid(Rslt, 44, 4)
    MorgAmt1 = Mid(Rslt, 48, 5)
    RentGross = Mid(Rslt, 53, 4)
    HHIncome = Mid(Rslt, 57, 7)
    PerNum = Mid(Rslt, 64, 4)
    PerWt = Mid(Rslt, 68, 4)
    Age = Mid(Rslt, 72, 3)
    Sex = Mid(Rslt, 75, 1)
    MarSt = Mid(Rslt, 76, 1)
    Race = Mid(Rslt, 77, 3)
    GradeAtt = Mid(Rslt, 80, 2)
    Occ1950 = Mid(Rslt, 82, 3)
    Ind1950 = Mid(Rslt, 85, 3)
    IncTot = Mid(Rslt, 88, 7)
    FamTotInc = Mid(Rslt, 95, 7)
   
    On Error GoTo 0
    Write #OutF, Yr, DataNum, Serial, NumPrec, HHwt, StateFIP, _
        County, Urban, City, CityPop, UrbPop, OwnerShip, _
        MorgTotal, MorgAmt1, RentGross, HHIncome, _
        PerNum, PerWt, Age, Sex, MarSt, Race, _
        GradeAtt, Occ1950, Ind1950, IncTot, FamTotInc

    End Sub
Sub openDB9()
    Dim X, Rslt As String, OutF As Long
    X = FreeFile()
    Open "c:\tushar\work\usa_00009.dat\usa_00009.dat" _
        For Binary As #X Len = 102
    OutF = FreeFile()
    Open "c:\tushar\work\usa_00009.txt" _
        For Output As #OutF
       
    Write #OutF, "Yr", "DataNum", "Serial", _
        "NumPrec", "HHwt", _
        "StateFIP", "County", "Urban", _
        "City", "CityPop", _
        "UrbPop", _
        "OwnerShip", "MorgTotal", _
        "MorgAmt1", "RentGross", _
        "HHIncome", _
        "PerNum", "PerWt", "Age", _
        "Sex", "MarSt", "Race", _
        "GradeAtt", "Occ1950", "Ind1950", _
        "IncTot", "FamTotInc"
    Rslt = String(102, " ")
    Get #X, , Rslt
    Do While Not EOF(X)
        writeOneRec OutF, Rslt
        Get #X, , Rslt
        Loop
    'Stop
   Close #X: Close #OutF
    End Sub

And, so now, I have a CSV data set ready to load into PowerPivot.

PowerPivot - Part 1 of 4

After the MVP Summit I was motivated to further explore a Microsoft add-in for Excel 2010 called PowerPivot. I don’t know all its capabilities but at the very least it can analyze millions of rows of data. If you are using Excel 2010 you can download the add-in from http://www.microsoft.com/downloads/details.aspx?FamilyID=48a5b47b-8c9c-450f-ab6e-178600a733ca&displaylang=en

Do keep one thing in mind. The Community Technical Preview (CTP) version of the PowerPivot addin that is currently available is incompatible with the currently available version of Office 2010, i.e., the Release Candidate (RC). The problem seems to lie in that the data in the PowerPivot tables are not stored in the Excel file as they should be. So, on reopening the file, there is no database for PP to use! I assume (hope?) Microsoft will fix this incompatibility soon.

A day after returning from the Summit, I ran into a contest organized by Microsoft on creative uses of PowerPivot. I scrambled to put together an entry since the contest closed in 2 days. First, I had to install PowerPivot. Then, I decided to use census data as the source for some kind of analysis. It took me quite a while to figure out how to get that data. Then, I analyzed the data set with PP. Finally, I decided to marry the “%change in population from one census to the next” with a solution I had posted on my web site to conditionally color shapes in an Excel worksheet (http://www.tushar-mehta.com/excel/charts/0301-dashboard-conditional%20shape%20colors.htm).

I will post more detailed reports on my experience with PP and the census data. Part 2 will be about my experience getting the census data. Part 3 will be about the add-in to conditionally color shapes, and Part 4 will be about using PP in a relatively basic manner while integrating the result with the add-in.

For now, here is an image representing the result.

powerpivot

For those curious about the contest (now closed), here is a link: http://wildfireapp.com/website/6/contests/19877/voteable_entries/3165298

Changing names of some files in a directory

This started off by helping myself to Dick’s code in ‘Removing Spaces from File Names’ (http://www.dailydoseofexcel.com/archives/2009/11/12/removing-spaces-from-file-names/) but quickly evolved to meet my own needs.

First, was the requirement to replace a certain text string by another. I added 3 parameters to the subroutine (Dirname, ReplaceWhat, and ReplaceBy). Also evident was that some files had leading spaces as well as multiple consecutive embedded blanks. I added an optional Boolean doTrim.

By using Dir rather than FileSystemObject I could restrict the returned file names to those that matched the search criteria. Consequently, it was certain that the file name would change.

Option Explicit
    Function addPathSeparator(ByVal DirName As String)
        Dim PS As String: PS = Application.PathSeparator
        If Right(DirName, Len(PS)) <> PS Then _
            DirName = DirName & PS
        addPathSeparator = DirName
        End Function
Sub FilenameReplace(ByVal DirName As String, ByVal ReplaceWhat As String, _
        ByVal ReplaceBy As String, Optional ByVal doTrim As Boolean = False)
    Dim CurrName As String
    DirName = addPathSeparator(DirName)
    CurrName = Dir(DirName & "*" & ReplaceWhat & "*")
    Do While CurrName <> ""
        Dim NewName As String
        NewName = Replace(CurrName, ReplaceWhat, ReplaceBy)
        If doTrim Then NewName = Application.WorksheetFunction.Trim(NewName)
            'VBA Trim leaves embedded multiple spaces alone; _
             Excel's TRIM changes them to a single space
        On Error GoTo Catch1
        Name DirName & CurrName As DirName & NewName
        GoTo Finally1
Catch1:
        Debug.Print "Error changing '" _
                & CurrName & "' to '" & NewName & "'" & vbNewLine _
            & "    Error: " & Err.Description _
                & " (" & Err.Number & ")"
        Resume Finally1
Finally1:
        CurrName = Dir()
        Loop

End Sub

Then, I found some files had characters just before the text to be replaced that were “special characters.” I added an optional boolean useRegExp together with the code to use a regular expression to do the cleaning.

So, a filename like ‘This is a file, change me.xls’ should become ‘This is a file changed you.xls’

Unlike the above code, Dir could not be used to restrict the filenames since it does not support regular expressions. Consequently, I included a test to ensure that the new name differed from the old name before using the Name statement to rename the file.

The code below has been lightly tested as in it worked for the few directories that I had to process, each with a different set of rules.

Option Explicit
#Const EarlyBind = False

Sub FilenameReplaceRegExp(ByVal DirName As String, _
        ByVal ReplaceWhat As String, ByVal ReplaceBy As String, _
        Optional ByVal doTrim As Boolean = False, _
        Optional useRegExp As Boolean = False)

    #If EarlyBind Then
    Dim RE As RegExp
    #Else
    Dim RE As Object
        #End If
    If useRegExp Then
        #If EarlyBind Then
        Set RE = New RegExp
        #Else
        Set RE = CreateObject("VBScript.RegExp")
            #End If
        RE.IgnoreCase = True
        RE.Global = True
        RE.Pattern = ReplaceWhat
        End If
    Dim CurrName As String
    DirName = addPathSeparator(DirName)
    CurrName = Dir(DirName & "*.*")
    Do While CurrName <> ""
        Dim NewName As String
        If useRegExp Then
            NewName = RE.Replace(CurrName, ReplaceBy)
        Else
            NewName = Replace(CurrName, ReplaceWhat, ReplaceBy)
            End If
        If doTrim Then _
            NewName = Application.WorksheetFunction.Trim(NewName)
            'VBA Trim leaves embedded multiple spaces alone; _
             Excel's TRIM changes them to a single space
        If NewName <> CurrName Then
            On Error GoTo Catch1
            Name DirName & CurrName As DirName & NewName
            GoTo Finally1
Catch1:
            Debug.Print "Error changing '" _
                    & CurrName & "' to '" & NewName & "'" & vbNewLine _
                & "    Error: " & Err.Description _
                    & " (" & Err.Number & ")"
            Resume Finally1
Finally1:
            End If
        CurrName = Dir()
        Loop

    End Sub

Invoke the above subroutine as

    FilenameReplaceRegExp "c:\dir to check", "\W*change me", "changed you", True, True

Tushar Mehta

Visualizing data in Excel: How would you interpret this graphic?

Because I have no clue. It’s from an university alumni magazine. The names I’ve smudged (or tried to anyways) are the various colleges. Could this graphic be done in Excel? Yes. Should it?

ddoe-whats-this-chart

Tushar Mehta (www.tushar-mehta.com)

Visualizing data in Excel: Balancing the physical and logical dimensions of a chart

When Excel creates a chart it assigns, by default, certain physical dimensions to the chart. These physical dimensions (height and width) are independent of the logical content of the chart. So, in the case of a column chart the physical shape of the chart is the same irrespective of whether it contains 2 columns or 200. Similarly, for a XY Scatter chart, the size of the chart is independent of the range of the X or Y dimensions. For example, if we were to plot data that constitute a circle, the default chart would look like Figure 1.

ddoe-chart1
Figure 1

This is clearly misleading since the physical width of the chart is greater than the height even though both the X and Y ranges are the same (each of them goes from -30 to +50) - and the minimum and maximum of the plotted values are also the same (-20 and +40 respectively).

In reverse, we could have data that represent an ellipse but because of the physical shape of the chart look like a circle - see Figure 2.

ddoe-chart22
Figure 2

In this case, though the physical dimensions of the chart match each other, the logical X range is 2.5 times larger than the logical Y range (-30to +50 and -10 to +30 respectively).

The code below adjusts the physical height of the active chart so that the ratio of the physical dimensions match the ratio of the logical ones. The corrected version of Figure 1 is Figure 3.

ddoe-chart4
Figure 3

and the corrected version of Figure 2 is Figure 4.

ddoe-chart31
Figure 4

There are two constants that control what the code in balanceAxis does. changeAxes causes the code to change the min. and max. values of both the X and Y axis based on the values of the series in the chart. useSeriesVals results in the code using the min. and max values of the series plotted in the chart for balance calculations rather than the min. and max. values of the 2 axes. Obviously, having changeAxes as true implies useSeriesVals.

To use the below, select a chart — it must be a chart where both X and Y axis are numeric scales — and run the balanceAxis subroutine (use ALT+F8 to open the Macro dialog box) . Given the wide range of possibilities with options across charts, the wide range of charts across all the different versions of Excel, the below works where it works.

Option Explicit
Private Function MinVal(aChart As Chart, Optional CategoryVals As Boolean)
    Dim I As Integer
    If CategoryVals Then _
        MinVal = Application.WorksheetFunction.Min( _
            aChart.SeriesCollection(1).XValues) _
    Else _
        MinVal = Application.WorksheetFunction.Min( _
            aChart.SeriesCollection(1).Values)
   
    For I = 2 To aChart.SeriesCollection.Count
        If CategoryVals Then _
            MinVal = Application.WorksheetFunction.Min(MinVal, _
                aChart.SeriesCollection(I).XValues) _
        Else _
            MinVal = Application.WorksheetFunction.Min(MinVal, _
                aChart.SeriesCollection(I).Values)
        Next I
    End Function
Private Function MaxVal(aChart As Chart, Optional CategoryVals As Boolean)
    Dim I As Integer
    If CategoryVals Then _
        MaxVal = Application.WorksheetFunction.Max( _
            aChart.SeriesCollection(1).XValues) _
    Else _
        MaxVal = Application.WorksheetFunction.Max( _
            aChart.SeriesCollection(1).Values)
   
    For I = 2 To aChart.SeriesCollection.Count
        If CategoryVals Then _
            MaxVal = Application.WorksheetFunction.Max(MaxVal, _
                aChart.SeriesCollection(I).XValues) _
        Else _
            MaxVal = Application.WorksheetFunction.Max(MaxVal, _
                aChart.SeriesCollection(I).Values)
        Next I
    End Function
Private Sub getMinMaxVals(ByVal aChart As Chart, ByVal useSeriesVals As Boolean, _
        ByRef XMin As Single, ByRef XMax As Single, _
        ByRef YMin As Single, ByRef YMax As Single)
    With aChart
    If useSeriesVals Then
        YMin = MinVal(aChart): YMax = MaxVal(aChart)
        XMin = MinVal(aChart, True): XMax = MaxVal(aChart, True)
    Else
        With .Axes(xlValue)
        YMin = .MinimumScale
        YMax = .MaximumScale
            End With
        With .Axes(xlCategory)
        XMin = .MinimumScale
        XMax = .MaximumScale
            End With
        End If
        End With
    End Sub
Private Sub updateChartAxes(aChart As Chart, _
        ByVal XMin As Single, ByVal XMax As Single, _
        ByVal YMin As Single, ByVal YMax As Single)
    With aChart
    With .Axes(xlValue)
    .MinimumScale = YMin
    .MaximumScale = YMax
        End With
    With .Axes(xlCategory)
    .MinimumScale = XMin
    .MaximumScale = XMax
        End With
        End With
    End Sub
Sub balanceAxis()
    Const changeAxes As Boolean = False, _
        useSeriesVals As Boolean = False
    Dim aChart As Chart: Set aChart = ActiveChart
    Dim XMin As Single, XMax As Single, _
        YMin As Single, YMax As Single
    Dim NewXMin As Single, NewXMax As Single, _
        NewYMin As Single, NewYMax As Single
    Const cMaxTries As Integer = 3
    Dim Ratio As Single, OldRatio As Single, Tries As Integer
        'If the axes are on automatic scaling, changing the _
         physical dimension may change the min/max axis values. _
         If that happens and if the height-to-width ratio changes, _
         the code will readjust the physical dimensions.  However, _
         to avoid a perpetual loop the number of attempts is _
         limited to cMaxTries
   Tries = 0
    Do
        Tries = Tries + 1
        getMinMaxVals aChart, changeAxes Or useSeriesVals, _
            XMin, XMax, YMin, YMax
        With aChart
        If changeAxes Then updateChartAxes aChart, XMin, XMax, YMin, YMax
        Dim DesiredHeight As Single
        OldRatio = (YMax - YMin) / (XMax - XMin)
        DesiredHeight = .PlotArea.Width * OldRatio
        .ChartArea.Height = DesiredHeight _
            * .ChartArea.Height / .PlotArea.Height
            'The above retains the ratio of the heights of the _
            chartarea and plotarea
       .PlotArea.Height = DesiredHeight
        getMinMaxVals aChart, changeAxes Or useSeriesVals, _
            NewXMin, NewXMax, NewYMin, NewYMax
        Ratio = (NewYMax - NewYMin) / (NewXMax - NewXMin)
        Debug.Print "Try " & Tries & ", Old Ratio=" & OldRatio _
            & ", Plotarea=" & .PlotArea.Height / .PlotArea.Width _
            & ", Chartarea=" & .ChartArea.Height / .ChartArea.Width _
            & ", Ratio=" & Ratio
            End With
        Loop Until Tries >= cMaxTries Or Abs(OldRatio - Ratio) < 0.00000001
    Debug.Print ""
    End Sub