Writing a simple Compiler on my own - Machine Code Generation Principles
6711 months agoBusy9 min read
All the Code of the series can be found at the Github repository:
IntroductionHello it's a me again @drifter1! Today we continue with my Compiler Series, a series where we implement a complete compiler for a simple C-like language by using the C-tools Flex and Bison. In this article we will get into the final phase of Compiler Design, which is Machine Code Generation, skipping Intermediate Code (AST) Optimizations for now...
The topics that we will cover today are:
- What Machine Code Generation is all about
- Machine Code Generation "Tasks"
- What we have to implement for our Simple Compiler
Requirements:Actually you need to read and understand all the topics that I covered in the series as a whole, as these articles will give you access to knowledge about:
- What Compiler Design is (mainly the steps)
- For which exact Language the Compiler is build for (Tokens and Grammar)
- How to use Flex and Bison
- How to implement a lexer and parser for the language using those tools
- What the Symbol Table is and how we implement it
- How we combine Flex and Bison together
- How we can pass information from the Lexer to the Parser
- How we define operator priorities, precedencies and associativity
- What Semantic Analysis is (Attributes, SDT etc.)
- How we do the so called "Scope Resolution"
- How we declare types and check the type inter-compatibility for different cases that include function parameters and assignments
- How we check function calls later on using a "Revisit queue", as function declarations mostly happen after functions get used (in our Language)
- Intermediate Code Representations, including AST's
- How we implement an AST (structure and management)
- Action Rules for AST nodes and more
Difficulty:Talking about the series in general this series can be rated:
- Intermediate to Advanced
Actual Tutorial Content
What Machine Code Generation is all aboutBy definition, code generation is the mechanism of the compiler that takes the input source code and converts it into machine code. This machine code can then be executed by a specific CPU. To simplify the steps that are needed for this generation, we are first generating machine-independent intermediate code in form of an Abstract Syntax Tree (AST). This approach gives us lots of benefits:
- Easier generation of abstract Code
- Simpler re-targeting to different machine specifications
- Source-code Optimizations can be applied easier on the AST
So, using the intermediate code (AST) that has been created in the previous compilation steps, we basically make our compiler machine-specific at the final step of machine code generation. All the previous steps are only language-specific, even though lots of languages are quite similar in lots of statements, making even a language-change feasible during the compiler creation. The final output of the compiler will be managed by this final machine code generation procedure.
The steps of compiler design are basically summarized in the diagram that I put in front of all the posts:
I hope that this diagram helps you understand where Machine Code Generation fits in. Also note that we will skip optimization for now. Either way, what exactly needs to be done during this step? Which are the tasks that have to be performed in order to achieve the final result of Assembly Code? Well, that's what we will be talking about in this next section...
Machine Code Generation "Tasks"In order to get the final machine code or assembly code for a specific system (CPU), we have to:
- Learn about our Target System and its Language
- Convert/Map the Intermediate Representation into the target machine's instruction set (Instruction selection)
- Allocate the registers in which values will be kept during the execution (Register allocation)
- Manage the variables of the program, storing them in registers and/or memory addresses
- Order the execution of the instructions (Instruction scheduling)
Let's dive more into each of those tasks...
Learn about the TargetGetting to know the target system's machine code is quite important, as it helps us understand machine-specific instructions. Using this knowledge we can generate the final code in a more convenient way. Let's also not forget that the processor architecture: CISC or RISC, is also very important as it changes the generation procedure a lot. In general, RISC processors have a smaller number of simple instructions that are executed in one cycle, whilst CISC have a larger number of complex instructions that need more than one cycles to complete.
Instruction SelectionDuring the compilation procedure the input source code is converted to an intermediate representation that mostly is an Abstract Syntax Tree (AST). Learning about the target machine in the previous step, we basically also got to know the instruction set of that machine. So, now the nodes of the AST or the intermediate code in general has to be mapped into those instructions. Of course each representation can be mapped in many ways and using quite a few different instructions in different combinations etc. Thus, it's our responsibility to make the code generator choose the most appropriate instruction for each case.
Register and Memory Address AllocationThe in-between results and various variables of the program have to be stored inside of the registers or memory addresses that are available in the target machine's architecture. It's wise to give each register and memory address a descriptor, which will track the availability of both. To speed-up and optimize the program we want to allocate as many registers as possible, by also don't having to store into and load from the memory that often. In the end, the allocation of the registers over time is basically the most important speed-factor of the compiler. Memory addresses have to be allocated for variables/values that need to be stored for safe-keeping, like arrays/vectors which mostly can't fit into registers.
Instruction SchedulingThe instructions are mostly executed in the order that they occur in the source code, but scheduling the execution-order of them can easily optimize our compiler, as we might get rid of some register-to-memory "moves". But, such optimizations should be made carefully as we don't want to change the behavior of the code.
What we have to implement for our Simple CompilerSo, after talking about all these procedures what do you think is needed for our language? Well, we of course have to:
- Learn about the MIPS Processor and its Instruction Set
- Create functions that generate/output the code for the various nodes of the AST
- Allocate the Registers and Memory Addresses for the variables and in-between results
You are quite lucky about the first one, as I already made a complete tutorial series about MIPS Assembly that you can find under "Assembly" in my recap. Either way, more compiler-specific articles about MIPS Assembly will still come! Those will include an overview of the MIPS Instruction Set and the examples translated in MIPS Assembly.
For the rest, well, we will have to study each AST node (declaration, statement, assignment etc.) on its own in separate articles, to create a function or sub-case for each specific case. We will even have to get into the sub-types of each nodetype! For example, to understand this last bit better, simple variables need to be declared quite differently to arrays/vectors, meaning that each of them will need a different function to generate machine-code, even though both are initializations. The changes might not be dramatic in lots of these cases, but they still have to be covered! So, let's get ready to rock ;)
Images:All of the images are custom-made!
Previous parts of the series
General Knowledge and Lexical Analysis
- A simple C Language
- Lexical Analysis using Flex
- Symbol Table (basic structure)
- Using Symbol Table in the Lexer
Semantic Analysis (1)
Intermediate Code Generation (AST)
Semantic Analysis (2)
Final words | Next up on the projectAnd this is actually it for today's post! I hope that I explained everything as much as I needed to, meaning that you learned something out of it.
Next up on this series are:
- Machine Code generation (MIPS Assembly)
- Various Optimizations in the Compiler's Code
Which are all topics for which we will need more than one article to complete them. Also, note that we might also get into AST Code Optimizations later on, or that we could even extend the Language by adding complex datatypes (structs and unions), more rules etc.
So, see ya next time!
Keep on drifting! ;)