Before we can run a C/C++ program, we have to compile our source code and then run the resulting executable. What really happens when we compile with gcc
or clang
or simply click "run" on our IDE?
Contents
Preprocessor
First of all, the C preprocessor (cpp
) runs through our source file and:
- Removes comments
- Expands macros: Replaces macro calls with their corresponding values, as defined by
#define SOMETHING some_value
directives. - Expands included files: Processes #include directives by including the contents of specified files at the specified locations in the code.
- Evaluates conditional compilation steps: Handles directives like #if, #ifdef, etc., to conditionally include or exclude portions of code during compilation.
Let's see how it all works.
This is our source code file nothing.c
that does, well, nothing at all:
main() {
int x = 42;
}
Let's call the preprocessor:
cpp nothing.c > nothing.i
.
And here are the contents of the intermediate file nothing.i
:
# 0 "nothing.c"
# 0 "<built-in>"
# 0 "<command-line>"
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 0 "<command-line>" 2
# 1 "nothing.c"
main() {
int x = 42;
}
Linemarkers
The added directives with the format # (line number) (filename) (flags)
(format indicated in the official GCC documentation) are called linemarkers. They provide information about the original source file and its location.
The possible flags are:
1
: A new file is included starting from this line.2
: We are done including the file and we return to the previous file3
: The file we are including is a system header file and should be treated as such.4
: This indicates that the file should be treated as an extern C block. In C++, many functions can have the same name (function overloading). The compiler will actually change (mangle) the names of the functions. With this flag, we tell explicitly to the compiler that the code in this file should not be mangled.
Now let's see the linemarkers in our code:
# 0 "nothing.c"
: Start reading the filenothing.c
from line 0.# 0 "<built-in>"
: Read the built-in macros predefined by the compiler.# 0 "<command-line>"
: Read command line options that should affect preprocessing.# 1 "/usr/include/stdc-predef.h" 1 3 4
: Include a new file (flag1
) from location/usr/include/stdc-predef.h
, which is a system header file (flag3
) and should not have its symbols mangled (flag4
)# 0 "<command-line>" 2
: Done including the previous file (flag2
), back to command-line.# 1 "nothing.c"
: We don't have any files that the code specifically ask us to include. Also we are done reading compiler macros, command line options and including system header files. So let's start reading line 1 of ournothing.c
code.
Comments and macros
Let's try a different, but equally useless program:
#define LIFE 42
main() {
/* The meaning of life, the universe and everything*/
int x = LIFE;
// Answer is 42
}
Here we used a preprocessor directive and a few comments in the code. Let's see what we get with:
cpp meaning_of_life.c > meaning_of_life.i
# 0 "meaning_of_life.c"
# 0 "<built-in>"
# 0 "<command-line>"
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 0 "<command-line>" 2
# 1 "meaning_of_life.c"
main() {
int x = 42;
}
Yup, that's exactly the same output as our previous example.
"Hello world"
So, let's try with some code that actually does something. How about, printing "Hello world" to our console?
#include <stdio.h>
int main(void) {
printf("Hello world\n");
return 0;
}
Then cpp hello.c > hello.i
and...
Oh, wow! A 733 line file in my system, with over 50 linemarkers. What happened here? Well, we included the stdio.h
header file, located in /usr/include/stdio.h
in Linux systems. This is the standard library header for file input and output. It defines several macros, variable types and functions related to file input and output. The entire contents of stdio.h
are pasted in the source file during the preprocessing stage.
Compiler
The compiler will take our preprocessed file and convert it to assembly code.
So, let's try to convert our "Hello world" program to assembly. We already preprocessed our source file, so, let's compile the hello.i
file:
gcc -S hello.i
:
.file "hello.c"
.text
.section .rodata
.LC0:
.string "Hello world"
.text
.globl main
.type main, @function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movl $.LC0, %edi
call puts
movl $0, %eax
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (GNU) 11.2.0"
.section .note.GNU-stack,"",@progbits
Your output might differ; assembly output is architecture specific, I use x86-64 on Linux.
It's both beyond the scope of the article and (way) beyond my knowledge level to explain the assembly output. Just some brief notes:
- Labels: ex,
.LC0
,.LFB0
,.LFE0
: Labels are names we assign to different locations of the program, ex, the start of a subroutine, or function. By GCC convention, local labels begin with.L
, suffixed by a number.LC0
: labels used for constant or strings in the read-only section of the data (.rodata
) -.LC0
,.LC1
, etc..LFB0
: labels used to denote the beginning of a function -.LFB0
,.LFB1
,.LFB2
, etc..LFE0
: labels used to denote the end of a function -.LFE0
,.LFE1
, etc.
-
Registers: Registers are temporary storage locations in the CPU that keep data and instructions for immediate processing. There are several types of registers. Some that we see here:
%rbp
- Base pointer: points to the base of the current stack frame, commonly used to access function parameters and local variables.%rsp
- Stack Pointer: points to the top of the current stack frame. As functions are called and return, it changes, keeping track of the current position in the stack. (stack: a region of memory used for storing function call information, local variables, etc.).%eax
- Accumulator: used for arithmetic and logical operations, as well as for storing function return values. in x86-64 architecture,%eax
is the lower 32 bits of%rax
, the full 64-bit version of the register.%edi
- Destination Index: commonly used for operations involving data movement or memory access
Without wanting to divert too much for the topic at hand, I want to note here that the
%rsp
,%eax
, etc. syntax is AT&T specific, while Intel syntax uses simplyrsp
,eax
, etc.
Assembler
As we already mentioned, the computer does not understand assembly. Assembly is but an intermediate code between the high-level language (C, in this case) and the machine code.
So, let's now assemble our hello.s
file to something that our computer might be able to understand:
as -o hello.o hello.s
.
And if we try to read the resulting object file, hello.o
, we will get something like:
^?ELF^B^A^A^@^@^@^@^@^@^@^@^A^@>^@^A^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@H^B^@^@^@^@^@^@^@^@^@^@@^@^@^@^@@
^@^N^@^M^@UH<89>å¿^@^@^@è^@^@^@,^@^@^@^@]ÃHello world^@^@CC:
(GNU) 11.2.0^@^@^@^@^@^@^D^@^@^@
Hmmm, so that's the kind of stuff our computer actually understands? That doesn't even look like binary!
Actually this is an output in binary, that, however, gets mangled when we open it with a text editor, as many of the bytes in the file are not printable ASCII characters.
However, this output is still not raw machine code that the CPU can execute. This is a binary file in the ELF format - Executable and Linkable Format - that contains metadata information and headers for the operating system's loader to interpret.
Let's poke it a bit further. Using objdump
, a program that helps us view an object file in assembly form:
objdump hello.o -s
Contents of section .text:
0000 554889e5 bf000000 00e80000 0000b800 UH..............
0010 0000005d c3 ...].
Contents of section .rodata:
0000 48656c6c 6f20776f 726c6400 Hello world.
Contents of section .comment:
0000 00474343 3a202847 4e552920 31312e32 .GCC: (GNU) 11.2
0010 2e3000 .0.
Contents of section .note.gnu.property:
0000 04000000 20000000 05000000 474e5500 .... .......GNU.
0010 020001c0 04000000 00000000 00000000 ................
0020 010001c0 04000000 01000000 00000000 ................
Contents of section .eh_frame:
0000 14000000 00000000 017a5200 01781001 .........zR..x..
0010 1b0c0708 90010000 1c000000 1c000000 ................
0020 00000000 15000000 00410e10 8602430d .........A....C.
0030 06500c07 08000000 .P......
Here, each line of the output is divided in 3 parts:
- The address, ex., 0000, 0010, 0020
- The hexadecimal representation of the actual binary data stored in the object file, ex.
554889e5 bf000000 00e80000 0000b800
- The ASCII representation of the hexadecimal data. For example, the mysterious
UH
in the beginning corresponds to the hexadecimal bytes55 48
The "mysterious" UH...
If you are wondering why the machine instructions look so "hesitant" to run our code, don't worry. It is (hopefully) not a critique us personally, the code, or something like "Uhhhh, you really waste my and the planet's resources for this?"
Assembly instructions, as we saw above, are in human-readable form. However, disassembled machine code does not care about readability. It is a hexadecimal representation of the binary instructions.
As we can verify in the HEX to ASCII text converter, the hex numbers 55 48
correspond to the text UH
.
However, 0x55
is also the opcode for the pushq
instruction, while 0x48 0x89 0xe5
is movq %rsp, %rbp
the lines that set up the new activation frame for a function. How do I know this? I could be searching and searching for manuals on my x86-64
architecture, or my GAS assembler, but it's so much simpler to disassemble my object file with
objdump -d hello.o
:
hello.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <main>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: bf 00 00 00 00 mov $0x0,%edi
9: e8 00 00 00 00 call e <main+0xe>
e: b8 00 00 00 00 mov $0x0,%eax
13: 5d pop %rbp
14: c3 ret
Human-readable information of the ELF file
While objdump
can display a lot information about object files, if we really want to figure out what is going on with our ELF file, how is it structured and what sections, headers and symbols it includes, we have to use readelf
:
readelf -a hello.o > hello.elf.txt
I will not paste the output of readelf
here, but, if you are curious, you can try the command above and examine the output. However, let's briefly talk about how the operating system's loader handles ELF files:
- Loads sections: The loader loads various sections of the ELF file (code, data) into memory.
- Resolves symbols: If the program uses external symbols (ex., functions from shared libraries), the loader resolves these symbols.
- Sets the memory up: The loader sets up the stack and heap areas in memory.
- Initialises registers: It may initialize certain registers, including the program counter, to the appropriate starting point.
- Sets entry point: It transfers control to the program's entry point, typically the
_start
label in the case of assembly code.
Linker
The linker combines the object code we generated in the previous step with any required supporting code to create an executable program.
Here is the command we will use:
ld -o hello -dynamic-linker /lib64/ld-linux-x86-64.so.2 /usr/lib64/crt1.o /usr/lib64/crti.o hello.o -lc /usr/lib64/crtn.o
This will produce an executable named hello
, that we can run with ./hello
.
So, what exactly happens during the linking stage? The linking stage involves merging all of our source files (if we have more than one), C or C++ runtime libraries, the system library and the dynamic linker
- The dynamic linker: The dynamic linker, or dynamic loader is an essential operating system component. It loads shared libraries into a running program at runtime, resolving symbols and addresses that reference these libraries.
- C/C++ runtime libraries: The objects
crt1.o
,crti.o
andcrtn.o
are part of the Core OS C Runtime Objects (CRT), provided by the system's C library (libc
, in my case). They contain the entry point (_start
) and exit point (_end
) of the program, whithout which the program would not be able to run or terminate successfully. - The system library: System libraries, such as
libc
, are crucial for the functioning of both the operating system and user programs. During the operating system's boot, essential libraries are loaded into memory. When running a program, the dynamic linker loads shared libraries likelibc
into the program's memory space. Explicitly linking the program tolibc
ensures access to its functions.
If there are multiple source files, the linker combines them to create the final executable, resolving symbols across different files:
┌─────────────┐ ┌─────────┐
│source file 1├─►│object 1 ├──────────────┐
├─────────────┤ ├─────────┤ │
├─────────────┤ ├─────────┤ │
│source file 2├─►│object 2 ├───────────┐ │
├─────────────┤ ├─────────┤ ▼ ▼
├─────────────┤ ├─────────┤ ┌──────┐ ┌──────────┐
│source file 3├─►│object 3 ├─────────►│Linker│──►│executable│
└─────────────┘ ├─────────┤ └──────┘ └──────────┘
├─────────┴─────────┐ ▲ ▲
│C runtime libraries├─┘ │
├──────────────┬────┘ │
├──────────────┤ │
│system library├─────────┘
└──────────────┘
Final executable
So, with this process, we got out final executable, hello
, that will print the words "Hello, world" when we run it.
We could have gotten all the intermediate files from gcc in one command:
gcc -o hello hello.c -save-temps
. However, GCC actually uses cpp
, as
and ld
under the hood.
The tools as
and ld
are part of GNU Binutils, while cpp
is a part of GCC itself, together with gcc
, the command that compiles C files. All together, these tools are part of the larger GNU toolchain, , which includes tools like GNU make, GNU C Library (glibc), GNU Binutils, GNU Bison, GNU m4, GNU Debugger (GDB), and GNU Autotools (GNU Build System).