Semiproductive Semiconductor

Things on which I've been working

Dynamic routing of a DCHP VPN connection

One of my clients is setting up two new VPNs — one is a connection between two office locations, and the other is the ability for users to dial in from home.

So, in either of the offices, we don’t care about reaching the single-PC VPN machines, but those machines do need to be able to reach computers at both offices. This article covers the routing situation for single-PC clients.

For hardware, we chose the SonicWall TZ100 as our appliances; one in each office. They’re pretty nice — they do VPNs,firewall, Antivirus, Spam, automatic failover/load balancing. They also provide L2TP Services, which work pretty well with the Windows Native VPN client.

We will have a couple of subnets, and a couple of VPN users; that’s a somewhat more complicated IP situation than the default situations provide. Our active subnets are:
192.168.1.x (main office),
192.168.2.x (remote office), and
192.168.4.x (VPN DHCP pool).

So, on a client PC, we want to route any traffic intended for those subnets to go through the VPN, any other network traffic we want to go out through the regular internet connection.

I was quite surprised to find out how difficult this is to pull off!

I finally settled on using WSCRIPT to accomplish that. You have to create the VPN connection item yourself, but then you can run a script connect the VPN, query the system for the IP address and set the routing.

Of course, you first need to create a stock Windows VPN connection. Then you have to set the Windows VPN item to disable the “make this gateway the default”. Then, you’ll need to use a bunch of ROUTE ADD commands to manually add the routes to the routing table.

However, the catch to that is that you need the IP address that the VPN provides, so that you can tell the computer which IP address to use as the route. There is no elegant way to get this information. There are a few WMI classes that look promising: Win32_NetworkAdapterConfiguration and Win32_NetworkAdapter. However, they only return the hard-coded IP address (or 0.0.0.0 if it’s DHCP) of the adapter, so that’s not useful.

My next thought was to fetch that information out of the registry. When the connection is made, that IP address is stored in the Registry, but only in HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\Tcpip\Parameters\Interfaces\{someGUID}. That key is identified with a GUID; you’d think that GUID would match up with something else in the system, but I sure couldn’t find it. I don’t mind using GUIDs in scripts, but trying to write a script that you can give to an end-user that requires them to locate and enter a GUID is a bad idea.

SO, there’s a lot of non-attractive options out there. I did find a piece of code on the web that ran IPConfig and piped it into an array, parsing it to pull out the IP addresses returned for any adapters connected to the machine. Piping the output of a utility into a string and parsing it is error-prone — who knows if Windows 8 will change out IPConfig works. But I don’t think there were any other options. I modified it somewhat to pull the IP address for a specified adapter:

  1. [edited for presentation]
  2. Function GetIPAddress_Name(AdapterName)
  3.    workfile = fso.gettempname
  4.    sh.run "%comspec% /c ipconfig > " & workfile,0,true
  5.    set ts = fso.opentextfile(workfile)
  6.    data = split(ts.readall,vbcrlf)
  7.    strIPAddress = ""
  8.    for n = 0 to ubound(data)
  9.       if instr(data(n)," adapter ") then ‘ we’re going to get the adapter name
  10.          curAdapter = mid(data(n), instr(data(n)," adapter ")+9)
  11.          if instr(curAdapter,":") then
  12.             curAdapter = left(curAdapter,instr(curAdapter,":")-1)
  13.          end if
  14.       end if
  15.       if ucase(curAdapter) = ucase(AdapterName) then
  16.          if instr(data(n),"IPv4 Address") or instr(data(n), "IP Address") then ‘after IPv6 came out, this was the title for IPv4.
  17.             parts = split(data(n),":")
  18.             if trim(parts(1)) <> "0.0.0.0" then
  19.                strIPAddress= trim(cstr(parts(1)))
  20.             end if
  21.          end if
  22.       end if
  23.    next
  24.    GetIPAddress_Name = strIPAddress
  25. End Function

That was the hard part. Now for the rest of the code.

  1. strUsername="secret"
  2. strPassword = "Secret!"
  3. ConnectionName = "PrimarySonicwall"
  4.  
  5. set sh = createobject("wscript.shell")
  6. sh.run "%comspec% /c route delete 192.168.1.0",0,true
  7. sh.run "%comspec% /c route delete 192.168.2.0",0,true
  8. sh.run "%comspec% /c route delete 192.168.4.0",0,true
  9. sh.run "%comspec% /c rasdial """ &amp; ConnectionName &amp; """ " &amp; strUsername &amp; " " &amp; strPassword,0,true
  10. myVPN_IP = GetIpAddress_Name(ConnectionName)
  11. wscript.echo ConnectionName &amp; " IP:" &amp; myVPN_IP
  12.  
  13. if myVPN_IP <> "" then
  14.    sh.run "%comspec% /c route add 192.168.1.0 mask 255.255.255.0 " &amp; myVPN_IP,0,true
  15.    sh.run "%comspec% /c route add 192.168.2.0 mask 255.255.255.0 " &amp; myVPN_IP,0,true
  16.    sh.run "%comspec% /c route add 192.168.4.0 mask 255.255.255.0 " &amp; myVPN_IP,0,true
  17.    wscript.echo "VPN connected!"
  18. else
  19.    wscript.echo "Connection Failed."
  20. end if

As you can see, it calls the “rasdial” command, which is the command that loads and runs a VPN connection. (It sure would be nice if such a command could be modified with a parameter to return the IP address, but no such luck.) Then once the connection is established, we fetch the IP address, clear the existing route table entries, and add the new ones.

And that’s it. Feel free to use this; hopefully it’ll save you a little time somewhere.

If anyone feels like adding, or pointing people to, a script to create the VPN connection in the first place, I’d be delighted to see it.

Full Source:

  1. strUsername="secret"
  2. strPassword = "Secret!"
  3. ConnectionName = "PrimarySonicwall"
  4.  
  5. set sh = createobject("wscript.shell")
  6. sh.run "%comspec% /c route delete 192.168.1.0",0,true
  7. sh.run "%comspec% /c route delete 192.168.2.0",0,true
  8. sh.run "%comspec% /c route delete 192.168.4.0",0,true
  9. sh.run "%comspec% /c rasdial """ &amp; ConnectionName &amp; """ " &amp; strUsername &amp; " " &amp; strPassword,0,true
  10. myVPN_IP = GetIpAddress_Name(ConnectionName)
  11. wscript.echo ConnectioName &amp; " IP:" &amp; myVPN_IP
  12.  
  13. if myVPN_IP <> "" then
  14.    sh.run "%comspec% /c route add 192.168.1.0 mask 255.255.255.0 " &amp; myVPN_IP,0,true
  15.    sh.run "%comspec% /c route add 192.168.2.0 mask 255.255.255.0 " &amp; myVPN_IP,0,true
  16.    sh.run "%comspec% /c route add 192.168.4.0 mask 255.255.255.0 " &amp; myVPN_IP,0,true
  17.    wscript.echo "VPN connected!"
  18. else
  19.    wscript.echo "Connection Failed."
  20. end if
  21.  
  22. Function GetIPAddress_Name(AdapterName)
  23. ‘=====
  24. ‘ Returns single of IP Address bound to AdapterName
  25. ‘ by ipconfig or winipcfg…
  26. ‘ Win98/WinNT have ipconfig (Win95 doesn’t)
  27. ‘ Win98/Win95 have winipcfg (WinNt doesn’t)
  28. ‘ Note: The PPP Adapter (Dial Up Adapter) is
  29. ‘ excluded if not connected (IP address will be 0.0.0.0)
  30. ‘ and included if it is connected.
  31. ‘=====
  32. set sh2 = createobject("wscript.shell")
  33. set fso = createobject("scripting.filesystemobject")
  34.  
  35. Set Env = sh.Environment("PROCESS")
  36.    if Env("OS") = "Windows_NT" then
  37.       workfile = fso.gettempname
  38.       sh.run "%comspec% /c ipconfig > " &amp; workfile,0,true
  39.    else
  40.       ‘winipcfg in batch mode sends output to
  41.       ‘filename winipcfg.out
  42.       workfile = "winipcfg.out"
  43.       sh.run "winipcfg /batch" ,0,true
  44.    end if
  45.    set sh2 = nothing
  46.    set ts = fso.opentextfile(workfile)
  47.    data = split(ts.readall,vbcrlf)
  48.    ts.close
  49.    set ts = nothing
  50.    fso.deletefile workfile
  51.    set fso = nothing
  52.    strIPAddress = ""
  53.    for n = 0 to ubound(data)
  54.       if instr(data(n)," adapter ") then ‘ we’re going to get the adapter name
  55.          curAdapter = mid(data(n), instr(data(n)," adapter ")+9)
  56.          if instr(curAdapter,":") then
  57.             curAdapter = left(curAdapter,instr(curAdapter,":")-1)
  58.          end if
  59.       end if
  60.       if ucase(curAdapter) = ucase(AdapterName) then
  61.          if instr(data(n),"IPv4 Address")then
  62.             parts = split(data(n),":")
  63.             if trim(parts(1)) <> "0.0.0.0" then
  64.                strIPAddress= trim(cstr(parts(1)))
  65.             end if
  66.          elseif instr(data(n),"IP Address") then
  67.             parts = split(data(n),":")
  68.             if trim(parts(1)) <> "0.0.0.0" then
  69.                strIPAddress= trim(cstr(parts(1)))
  70.             end if
  71.          end if
  72.       end if
  73.    next
  74.    GetIPAddress_Name = strIPAddress
  75. End Function

OpenCL on nVidia hardware

So one of my clients does a lot of data analysis on large, wide data sets. (Think 4000 columns and 10 million rows.) One of the thornier problems we face in processing is a variant on the Nearest-Neighbor calculation. The columns we use are all various demographic indicators of a set of people, and we generate the pythagorean distance from each person in the set to each other person. The Pythagorean distance is merely a giant hypoteneuse — for a simple triangle, H = sqrt(a² + b²). Well, for a multi-dimensional calculation (purely theoretical), H = sqrt(a² + b² + ….. + n²). Then, suppose you have John Q Public, and you know 100 different tidbits about him (age, gender, level of education, income level, marital status, kids, own/rent.)

If you can find a way to convert those attributes into meaningful numbers, you can then generate a hypoteneuse as follows:

let:

  • A be the percentile of Age;
  • B be percentile of years of education;
  • C be percentile of salary; and
  • D be percentile of how many children one has.

JQP is the index in the vector for John Q Public.  x is the index for any given other person)

	H[JQP][x] = sqrt((A[JQP] - A[x])² + (B[JQP] - B[x])² + (C[JQP] - C[x])² + (D[JQP] - D[x])²)

As you can imagine, this turns into a lot of computation – number of total row comparisons is on the order of, N², and the number of column comparisons is linear. So 10 million row comparisons comes to 50 trillion items at 4000 subtractions, squares, additions, and a square root per row.

Now, some of the details of generating the distance, and its value to the real-world problems that are being approached is debatable, but I’m just the coder, so that’s not my problem. Besides, those are details that can be tweaked once we have a number-crunching mechanism in place.

Fortunately, this also lends itself very well to parallelization. Each computation can be done completely independent of each other. You may see optimizations in grouping several thousand nearby row comparisons in a chunk of memory, but in the abstract, you could farm out any individual row to a girl in Borneo with a calculator and wait for her to finish. (This solution is not recommended as it does not scale well.)

Detailed optimizations and file-management factors aside, this is an excellent case for experimenting with GPUs as computational platforms. ATI and nVidia are major GPU platforms, and Intel claims to be working on their own GPU. We were looking at CUDA and DirectCompute, but chose OpenCL for our proof-of-concept. OpenCL is a nice compromise between CUDA’s closed architecture and DirectCompute’s overly-generic approach. OpenCL is a common layer for both ATI and nVidia, and claims to be working on a broad array of other hardware architectures.  In my experience, it’s very similar to the CUDA code; some of the keywords and compilation processes are a little different, but it’s the same general idea. I’m eager to see what, if any, headway they’ll be making on other architectures like the Cell processor and the like.

There are copious documents available (and some particularly good ones from nVidia) to describe the architecture of a GPU, but here’s the basics: You have a large chunk of general-purpose RAM (fast by PC standards, but very slow compared to the math units on the card), a handful of Logic Processors (LU) that coordinate threads, move data around, and other boring stuff, and a large number (16 per LU) of math units. These math units have 64K of ram each, run in complete parallel to other math units, and are capable of running multiple threads on each, so each task doesn’t really have 64K of ram available — that’s shared among some other number of threads running on the math unit. The LU coordinates these tasks and instructs the math units what to work on and when to do them. (Including giving them other work to do while they’re waiting on results from global memory.) There’s also a small amount (64K) of fast shared memory for each LU, so each of the math units can share some data with one another. The individual tasks that are run are called Kernels.  A Kernel is an instance of a task to perform; the idea is that there will be a lot of Kernels run, some on each math unit and managed by the LU.

For example, suppose you have 100 numbers for which you need to generate their cube root. You would create an array of 100 source numbers and a blank array for 100 result numbers. You have 1 LU and 16 math units. The LU will say “ok, math unit A, you take kernel instances 0, 16, 32, 48, 64, 80, and 96. Math unit B, you take 1, 17, 33, 49, 65, 81, and 97.” And so on until all 100 kernels have been instantiated and assigned to a math unit. Each math unit will do a few cycles of preparation, then for kernel 0, it’ll have to do the original fetch of source number 0 from global memory. It issues the fetch instruction, and now has a few hundred cycles on its hands before getting a result. (Each of the math units will be in the same situation.) So the LU will instruct them to get started on kernels 16, 17, 18, and so on. They’ll issue the fetch for that and so on, until either they’re out of new jobs to start or get their results back for kernel 0. Once they have their source numbers, they perform the math on them. When the math is done, they send the numbers back out to their destination and address the next thread that’s ready to have work done on it.

Obviously, this can get much, much more complicated, but that’s the general idea. Refer to the nVidia documentation if you want to know more.

So we wanted to build a somewhat-generalizable platform for experimentation that would let us leverage what we wrote in the future. So the first task was building the boring stuff: datafile load/save and error-reporting mechanisms. OpenCL is in C, so I had to refamiliarize myself with mallocs and the like. I just used tab-delimited files to store a series of ASCII integers between 0 and 99 as the input file, and a series of ASCII floats for the output.

The output format is going to be a triangle – there’s no need to compare row 1 to row 3 AND row 3 to row 1, so the output is (N * (N-1))/2 cells arranged in N-1 rows.

Once the I/O stuff is handled, the next step is to allocate large buffers in memory (ultimately, production-level scalability concerns will involve breaking the data into chunks that can fit into RAM on both the host computer and the GPU) for both the source and the result data, and also creating the “global memory” buffer on the GPU for the same purpose. Next, I created a general plan of attack for the Kernel. Each math unit has 16K (16384) 32-bit floating-point registers; that means for a 4000-column dataset, you can hold 4 threads’ worth of working data if you’re careful with it. (See above about multiple threads for math units)

In this case, we’ll be going from 1 to N, comparing each row’s predecessor to all the rows after it. So, here’s the code that creates the kernels and sets up their parameters:

  1.  for (size_t i = 1; i < iNumRows; i++)
  2.      {
  3.         *iNumKernels = iNumRows – i;
  4.         iNumRows = iNumRows;
  5.         baseRow = (size_t *)malloc(sizeof(size_t));
  6.         *baseRow = i – 1;
  7.         hKernel = clCreateKernel(hProgram, "rowCompareKernel", 0);
  8.         shrCheckError(ciErrNum, CL_SUCCESS);
  9.         ciErrNum = clSetKernelArg(hKernel, 0, sizeof(size_t), &amp;iNumElements); // RowWidth — number of cells per row
  10.         shrCheckError(ciErrNum, CL_SUCCESS);
  11.         ciErrNum = clSetKernelArg(hKernel, 1, sizeof(size_t), &amp;iNumColumns); // NumCols — number of cells w/data
  12.         shrCheckError(ciErrNum, CL_SUCCESS);
  13.         ciErrNum = clSetKernelArg(hKernel, 2, sizeof(size_t), (const void*) baseRow); // BaseIndex
  14.         shrCheckError(ciErrNum, CL_SUCCESS);
  15.         ciErrNum = clSetKernelArg(hKernel, 3, sizeof(size_t), &amp;iNumRows); // NumRows
  16.         shrCheckError(ciErrNum, CL_SUCCESS);
  17.         ciErrNum = clSetKernelArg(hKernel, 4, sizeof(size_t), iNumKernels); // Total number of threads we’re going to start for this iteration
  18.         shrCheckError(ciErrNum, CL_SUCCESS);
  19.         ciErrNum = clSetKernelArg(hKernel, 5, sizeof(cl_mem), (void *)&amp;InputData); // BigArray
  20.         shrCheckError(ciErrNum, CL_SUCCESS);
  21.         ciErrNum = clSetKernelArg(hKernel, 6, sizeof(cl_mem) , (void *)&amp;OutputData); // ResultsArray
  22.         shrCheckError(ciErrNum, CL_SUCCESS);
  23.         // set work-item dimensions
  24.         global_work_size[0] = *iNumKernels;
  25.         local_work_size[0]= *iNumKernels;

you can only have 512 threads in a work group. More than that, and you’ll need to create another work group. So, Global_work_size >= local_work_size, and must be a multiple of local_work_size. If there’s a remainder, we need to create the threads for all the remainders and put a “don’t do anything” in the kernel for the instances that are not needed.

  1.   if (local_work_size[0] > workgroup_size)
  2.    {
  3.       // ok, first find out how many local groups we’re going to need
  4.       size_t group_count;
  5.       size_t group_remainder;
  6.       group_count = *iNumKernels / workgroup_size;
  7.       group_remainder = *iNumKernels % workgroup_size;
  8.       if (group_remainder > 0)
  9.       {
  10.          group_count++;
  11.       }
  12.       global_work_size[0] = workgroup_size * group_count;
  13.       local_work_size[0] = workgroup_size;
  14.    }
  15.    // the last parameter is an event.  I’m not doing much with them for execution, but it’s required for the timer.
  16.    ciErrNum = clEnqueueNDRangeKernel(hCmdQueue, hKernel, 1, 0, global_work_size, local_work_size ,  0, 0, &GPUExecution[0]);
  17. }

Note the comment about work groups:
you can only have 512 threads in a work group. More than that, and you’ll need to create another work group. So, Global_work_size >= local_work_size, and must be a multiple of local_work_size. If there’s a remainder, we need to create the threads
for all the remainders and put a “don’t do anything” in the kernel for the instances that are not needed.

Similarly, there’s a performance tweak in here about “strides” — the GPU access data a little faster if all the fetches line up to some multiple of the number of math units in a thread group. So it’s slightly faster to make sure the width of your rows is some multiple of the number of math units. This means that the width of the row might not equal the number of cells. That’s why there are separate parameters for RowWidth and NumCols.

The other parameters are pretty straightforward — Base row, which is the row against which all the kernels are comparing their rows, the number of rows in the set, number of kernels in the iteration, and pointers to the beginning of the input and output arrays.

All the calls to the GPU are asynchronous, but you can fake synchronicity by making the calls blocking.

Now let’s take a look at the Kernel code. A rather odd (IMO) feature of OpenCL is that the kernel is compiled at runtime from a separate source file. That wouldn’t be my preference, but it’s not a very big deal, since the kernels aren’t very big, and the whole point of the process is to cut down on stuff that would take minutes or hours to run. So it’s acceptable, but still rather strange.

  1. __kernel void rowCompareKernel(int RowWidth, int NumCols, int BaseIndex, int NumRows, int WorkSize, __global const float* BigArray, __global float* ResultsArray)
  2. }
  3.  

the __global prefix means that it’s global memory on the video card.

NumCols doesn’t necessarily equal RowWidth — RowWidth is rounded up to a size that optimizes memory fetch — usually a multiple of 32 or something.  NumCols is the number of columns that actually contains data.

  1.   __local float BaseRow[1024];
  2.    float accumulator;
  3.    float tempVal;
  4.    float tempVal2;
  5.    uint RowToCompare;
  6.    uint myID;
  7.    // ok, i’m going to cheat and move the base row to shared memory for thread 0.

What I’m doing is having only thread #0 move a copy of the base row to shared memory. This shared memory is very fast and is accessible from all the kernels, so we might as well use it. It seems rather strange that the only way to do this is have one Kernel do the move and have the rest sit around, but that seems to be the case. We could streamline this somewhat by having each thread move a few bytes, if that helps.

  1.  
  2.    if (get_local_id(0) == 0)
  3.    {
  4.       for (int Y = 0; Y < RowWidth; Y++)
  5.       {
  6.           BaseRow[Y] = BigArray[(RowWidth * BaseIndex)+Y];
  7.       }
  8.    }
  9.    barrier(CLK_LOCAL_MEM_FENCE);
  10.  

This barrier call forces the Logic Unit to wait till all the kernels have reached this point (obviously, thread 0 will be the critical path on this one) before any of them progresses beyond this point. Since all the kernels will be depending on the data move above, we need them to wait.

  1.   myID = get_global_id(0);
  2.    if (myID <= WorkSize)
  3.    {
  4.       RowToCompare = BaseIndex + 1 + myID;
  5.       accumulator = 0 ;
  6.       for (int x = 0; x < NumCols; x++)
  7.       {
  8.          tempVal = BigArray[(RowToCompare * RowWidth) + x] – BaseRow[x];
  9.          tempVal2 = tempVal * tempVal;
  10.          accumulator += tempVal2;
  11.       }

As I stated before, the output array will not be rectangular, so we have to calculate the proper place to put the result. We could be
inefficient in our use of memory space, but we’d run into problems with that eventually, so we might as well do it right.

  1.   int SumOneN;
  2.    int SumOneB;
  3.    int MaxIndex = (NumRows -1);
  4.    uint ResultsOffset;
  5.    uint iTemp = 0;
  6.    for (int i = 1; i <= BaseIndex; i++)
  7.    {
  8.       iTemp += (NumRows – i);
  9.    }
  10.    ResultsOffset = iTemp;
  11.    float TempValTwo;
  12.    TempValTwo = sqrt(accumulator);
  13.    ResultsArray[ResultsOffset + myID] = TempValTwo;
  14.    }
  15. }

…And that’s it. The performance of this was not very good, frankly; I was able to get about a 15% increase over a CPU-based version of the same calculation with a 1000×1000 input array. The bulk of the computation gains were eaten up by the overhead of compiling the kernel and moving the data to/from the GPU. Unfortunately, when I tried to scale it up, I ran into memory problems that were outside the scope of the proof-of-concept project.

Further efforts on this will focus on scalability issues and more efficient data-management approaches.

So what’s this blog thing about?

Well, i’m gonna give it a shot.  We’ll see how it works out.