Deep inside CPU: Raw multicore programming - CodeProject


The Infamous Trilogy: Part 3

In my previous 2 gory articles about CPU internals and Virtualization I explained various CPU internals, but not how multiprocessors actually work. Here 's a small working (however dirty) code that will help you catch up with multicore processing. 

In my next "Low Level M3ss" article I use this to implement a DOS multicore interface.

4x1 = 1x4

Basically, it's simple. Each of the CPUs has its own set of registers and modes. Only the memory is shared between them. That means that, in order to put the 8 cores of an i7 into long mode, we have to execute the very same procedure for each of the cores, because each core has its own register set, GDT, LDT etc. Therefore, we are able to start a CPU in real mode and keep it there, while directing another CPU to long mode.

The same occurs in virtualization. In my Virtualization Article I explain how to put the CPU in this state and, in order to put the entire machine in Virtualization, each one of the CPUs must be directed into Virtualization, which means setting VMCS , VMX regions etc for all of them.

As you already know, the CPU starts from 0xFFFF:0xFFF0, but this is only true for the first CPU; All other CPUs stay "asleep" until woken up, in a special state called Wait-for-SIPI. The main CPU awakes other CPUs by sending a SIPI (Startup Inter-Processor Interrupt) which contains the startup address for that CPU. Later on, there are other Inter-processor Interrupts to communicate between the CPUs. 

Therefore, in a really weird driver I 'm creating, I should be able to take one processor from windows, get it back to real mode, then virtualize all of the rest CPUs to create an inside debugger. Ha! Not so easy eh.

Preparing the Party

Because multicore programming has nothing to do with CPU modes, you can do it in any mode. Actually, you could do it in real mode, but the memory we need to access is above 1MB which means that you have to enter protected mode. You have the following options:

  • Use FASM, as in my other articles, do to the entire thing in assembly in either protected mode or long mode. Good luck. I 've been using this in my next article.
  • Use unreal mode and work in Turbo C++ and TASM. Turbo C++ is a quite old C++ compiler (no templates) but you actually just need some plain C. Special ASM routines are used to put the CPU in unreal mode, and to read/write the memory over 1MB. That is what I use in this article.
  • Use DJGPP for DOS, which produces DPMI executables. This uses GCC and it has a far-address model which you can neutralize to 32-bit linear addresses by calling __djgpp_nearptr_enable() and add the value from __djgpp_conventional_base to any pointer. However this is only valid for addresses under 1MB and it doesn't work (at least, in my testing machine) for higher addresses. If you have any luck with it, then it will be easier since you can do the entire thing in C.
  • Use Bochs, but this code wants ACPI 2.0 or later and Bochs is 1.0, so a slight modification is needed.

You need a DOS setup. FreeDOS works as long as no 386 memory manager is installed, for once the CPU is in VM86 mode we can't anymore put it in unreal mode. However Turbo C++, as I 've found, is only able to work when a 386 memory manager is installed, so it is a nasty thing for me to keep rebooting the VM, once to compile and once to test. 

VMWare and VirtualBox will work (as long as your CPU supports unrestricted guest in VMs), and I strongly suggest to setup a VM instead of working in plain DOS, for I simply edit the source in Windows and pass them to the VM through the VMWare sharing folders for DOS (Check a tool named vmsmount, to mount VMWare shared folders in DOS.).

The easiest thing to work with is DOSBox. Install Turbo C++ and Turbo Assembler there, then mount the directory so you can edit the sources in a Windows editor. Then you simply run the executables in VM (you can't run them in DOSBox since DOSBox does not expose an ACPI). This approach is way faster than installing Turbo C++ inside the VM, for actual DOS is awfully slow in compilation.

So, the best setup for me (and possibly for you) is:

  • Create your sources in a Windows editor, say, Notepad++.
  • Install DOSBox and mount directories to view these sources.
  • Install TC++ and TASM in DOSBox.
  • Edit sources in Notepad++, then compile in DOSBox.
  • Go to a VM, where you 've also mounted the same directory, and run the executable under FreeDOS.


All this stuff is done by the APIC (Advanced Programmable Interrupt Controller). It is basically a set of tables in memory which are examined by the controller and the controller reacts on our modifications in the table registers (memory offsets). You find more about the APIC by searching the ACPI (Advanced Configuration and Power Interface). After verifying that we actually have an APIC somewhere (CPUID param 1, and then check EDX bit 9), the first thing we must do is to find where the ACPI is in the memory. The ACPI is in one of the following locations:

  • In a place, for which a real mode segment pointer is stored at memory address 040E (I 've never seen it there myself).
  • In BIOS memory somewhere between physical 0xE0000 and 0xFFFFF .

Searching for the ACPI, we will locate it by it's 8-byte signature 0x2052545020445352. If this signature is not found in the memory, then we don't have ACPI and therefore there are no multiple CPU cores. 

As stated in RSDP, this is merely the signature of a larger structure. We might have ACPI 1.0 or ACPI 2.0 and we will save the structure data for further use. Each ACPI  table has a checksum and the total sum of all the bytes in an ACPI table must be a value with the lower byte equal to zero:

// Yes I indent braces. Sue me :P
int ChecksumValid(unsigned char* addr,int cnt)
    unsigned long a1 = 0;
    for(int i = 0 ; i < cnt ; i++)
        a1 += *(addr + i);
    if((a1 & 0xFF) == 0)
        return 1;
    return 0;

Having found the RSDP, we take the address of the starting ACPI table in memory from its fields. Note that, for simplicity, I only mess with ACPI 2+ or newer, which is actually a RSDPDescriptor20 structure that contains an 64-bit physical address of the starting ACPI tables. This physical address is over the 1MB (actually, it is an 64-bit address but it is always in the lower 4GB area to allow 32-bit systems to work) and hence it is only accessible from protected mode, or, in our little program, from unreal mode. This is actually a linked list of some tables, all subclasses of the basic structure ACPISDTHeader, which contains the length of each structure in bytes. There are many ACPI tables and we are only interested in a few of them.


struct ACPISDTHeader 
  char Signature[4];
  unsigned long Length;
  unsigned char Revision;
  unsigned char Checksum;
  char OEMID[6];
  char OEMTableID[8];
  unsigned long OEMRevision;
  unsigned long CreatorID;
  unsigned long CreatorRevision;

All ACPI tables start with this structure as a header, and the Length member tells us the total number of bytes that the structure has, so we can find the next structure in the memory until the Length is 0 (or the checksum is invalid).


How many CPUs do I have?

This is the easy part. You have to find the "MADT" ACPI table in the memory, and then I pass the memory to DumpMadt, which will printf the "Local Processor" 2 times because my Virtual Machine is configured with 2 CPU cores. Note that the MADT also informs us of the Local APIC Address (which is always by default at physical address 0xFEE00000).

Each CPU has its own Local APIC. This APIC handles interrupts for the CPU. It contains various stuff, such as a Local Vector Table (LVT) which is a translation between local interrupts (such as the clock) to an actual interrupt vector.  There is also one I/O APIC, which provieds multiprocessor management. The MADT also tells us the address of the I/O APIC, which is also by default at physical address 0xFEC00000). Both locations can be changed by setting the MSR, but in our program we will let them at their default values.

Note that the CPU does not know how much memory you have. Even if you only have 4MB of ram, the Local APIC address is still at physical address 0xFEE00000. 

Examining the MADT will give us all the above information:


void DumpMadt(char* pmadt)
    ACPISDTHeader* madt = (ACPISDTHeader*)pmadt;
    int le = madt->Length;
    myprintf("\tMADT Length: %d\r\n",le);
    // Save Local APIC Address
    char* a0 = pmadt + 0x24;
    LocalControllerAddress = *(unsigned long*)a0;
    myprintf("\tMADT Local APIC Address: %lX\r\n",LocalControllerAddress);
    char* a1 = pmadt + 0x2C; // Go to variable table entries
    le -= 0x2C;
    for(; le > 0 ;)
        char Ty = a1[0];
        char Le = a1[1];
        if (Ty == 0)
            A_CPU c;
            c.AcpiID = (char)a1[2];
            c.ApicID = (char)a1[3];
            c.flags = *(unsigned long*)(a1 + 4);
            cpus[TotalCPUS++] = c;
            myprintf("\tMADT Entry Type: %i Local Processor with ACPI ID %d, APIC ID %d, Flags %ld\r\n",Ty,(char)a1[2],(char)a1[3],*(unsigned long*)(a1 + 4));
        if (Ty == 1)
            myprintf("\tMADT Entry Type: %i I/O APIC with APIC ID %d, I/O APIC address %lX and Base %lX\r\n",Ty,(char)a1[2],*(unsigned long*)(a1 + 4),*(unsigned long*)(a1 + 8));
            IOAPIC = *(unsigned long*)(a1 + 4);
        if (Ty == 2)
            myprintf("\tMADT Entry Type: %i ISO \r\n",Ty);
        le -= Le;
        a1 += Le;


Configuring the Local APIC

To prepare the APIC to manage interrupts we have to enable the "Spurious Interrupt Vector Register", indexed at 0xF0:

  • Write32(Addr + 0xF0, 0x1FF); // Bit 0-7 is the interrupt vector, Bit 8 means "Software Enable APIC"

After that, we are ready to send IPIs. An IPI (Interprocessor Interrupt) is sent by using the Interrupt Command Register of the Local APIC. This consists of two 32-bit registers, one at offset 0x300 and one at offset 0x310 (All Local APIC registers are aligned to 16 bytes): 

  • The register at 0x310 is what we write it first, and it contains the Local APIC of the processor we want to send the interrupt at the bits 24 - 27.  
  • The register at 0x300 has the following structure:
struct R300
    unsigned char VectorNumber; // Starting page for SIPI
    unsigned char DestinationMode:3; //  0 normal, 1 low, 2 SMI, 4 NMI, 5 Init, 6 SIPI 
    unsigned char DestinationModeType:1; // 0 for physical 1 for logical
    unsigned char DeliveryStatus:1; // 0 - message delivered
    unsigned char R1:1;
    unsigned char InitDeAssertClear:1; 
    unsigned char InitDeAssertSet:1;
    unsigned char R2:2;
    unsigned char DestinationType:2; // 0 normal, 1 send to me, 2 send to all, 3 send to all except me
    unsigned char R3:12;

Writing to register 0x300 will actually send the IPI (that is why you must write to 0x310 first). Note that if DestinationType is not 0, the Destination target in the register 0x310 is ignored. Under Windows, IPIs are sent with an IRQL level 29. 

To awake the processor, we send two special IPIs. The first is the "Init" IPI, DestinationMode 5, which stores the starting address for the CPU. Remember that the CPU starts in real mode. Because the processor starts in real mode, we have to give it a real memory address, stored in VectorNumber.  The second IPI is the SIPI, DestinationMode 6, which starts the CPU. By convention, 2 SIPIs are sent with a delay between them. 

Because the starting address must be aligned to 4096, my code transfers the code from the ASM source to hardcoded address 0x80000 for a quick solution to that.

Finally, you need to write "End of Interrupt" (Local Apic + 0xB0) the value 0, to indicate that you can send another interrupt.

Assembly tricks

In the ASM source, you can find some assembly tricks.  Apart from the EnterUnreal routine, there are set/get functions for memory in unreal mode.  Following is the ReadX function, to read data from a 32-bit physical address. The function is called with 3 parameters, an unsigned long for the physical address, a word for the byte count and a far pointer for the memory to be written.  In this code I use the FS register to move memory between fs:[esi] to es:[di]. Yes I could use REP MOVSB If I had used DS instead, but I am not sure if and how Turbo C++ messes with DS, and if DS loses the "unreal" feature the code will fail. Therefore I decided to implement it with FS, which is less likely to be touched by Turbo C++.  If you do it in assembly, you will of course implement it with DS.

    PUSH BP   ; Why do we push BP anyway. Is it elsewhere used except as a stack pointer?
    ; Read from Address
    MOV ESI,[BP + 6]
    ; Count
    MOV CX,[BP + 10];
    ; Far pointer to store result
    MOV DI,[BP + 12];
    MOV DX,[BP + 14];
    ; We read from FS:[ESI]
    ; And store to ES:[DI]
        or cx,cx
        jz L1REnd
        mov al,fs:[esi];
        mov es:[di],al
        inc esi
        inc di
        dec ecx
        jmp L1R
    POP ES
    POP BP

The CPU core will start at at an EntryPoint1 function:


    INC [gs:_CHECK_BYTE1]
    CMP [gs:_CHECK_BYTE2],1
    JNE ap_1
    JMP ap_1 ; In case of a NMI            


It increases a shared counter (which is checked by the C program) and tests for another counter to end (with opcode HLT).


Thread Safety

Ha! As you can guess, no DOS function is thread safe. That means that, to call DOS from other CPUs you must perform proper synchronization. Hey, should I create a DPMI version 2.0 which provides both 64-bit support and multicore interfaces? :) Yes I did it.


It's not very tough, as you saw. The problem is to synchronize all this thing along. More on the next article!



27 - 03 - 2015 : Some typos and INIT IPI.
25 - 03 - 2015 : First Release