LLVM MC 简介

本文针对 LLVM MC做了简单的介绍。受限于笔者知识水平，文中可能会存在某些理解身上的偏差，欢迎批评指正。

兴智开发者社区小助理

46人浏览 · 2023-07-20 15:53:08

兴智开发者社区小助理 · 2023-07-20 15:53:08 发布

1. 整体介绍

LLVM MC (machine code) 层位于LLVM的底层，主要功能是负责汇编 (assembly)、反汇编 (disassembly)、以及生成二进制文件。作为LLVM的子项目，可以通过llvm-mc、llvm-objudump等工具直接操纵MC层。LLVM MC 的核心是引入了新的'MCInst'类来表示一个带有操作数的指令，这与代码生成器现有的指令概念'MachineInstr'不同。LLVM MC 整体框架如下图所示：

按照输入不同主要分为两条路径：

输入为汇编文件，经过Assembly Parser将汇编文件中的指令解析成Operand，然后通过MCTargetAsmParser将Operand转换成对应的MCInst,最终进入到MCStreamer,通过Instruction Encoder生成二进制文件。
输入为二进制文件，经过Instruction Decoder，将二进制代码反汇编生成MCInst, 最终进入到MCStreamer按照需要生成汇编文件。

LLVM MC的主要构成部分包括：

Instruction Encoder
Instruction Decoder
Assembly Parser

其中Instruction Encoder主要提供汇编接口，各个后端 (Target) 需要根据自己需求来实现具体的反汇编功能; Assembly Parser负责对汇编文件中的指令进行解析。

下面以LLVM的后端Cpu0为例介绍这三个组件。

2. LLVM MC 构成部分介绍

(1) Assembly Parser

Cpu0后端的Assembly Parser的作用是将汇编文件中的指令解析为LLVM MCInst，其由一个源文件'Cpu0AsmParser.cpp'组成，该文件包含从'MCTargetAsmParser'继承的‘Cpu0AsmParser’类：

public:
  Cpu0AsmParser(const MCSubtargetInfo &sti, MCAsmParser &parser,
                const MCInstrInfo &MII, const MCTargetOptions &Options)
    : MCTargetAsmParser(Options, sti, MII), Parser(parser) {
    setAvailableFeatures(ComputeAvailableFeatures(getSTI().getFeatureBits()));
  }

  MCAsmParser &getParser() const { return Parser; }
  MCAsmLexer &getLexer() const { return Parser.getLexer(); }

在类声明中出现了以下两行，用于导入TableGen中与汇编的相关指令生成的函数。

#define GET_ASSEMBLER_HEADER
#include "Cpu0GenAsmMatcher.inc"

在'Cpu0AsmParser.cpp'中定义了Cpu0Operand类，包含了需要被解析的机器指令的操作数类型和内容信息。

class Cpu0Operand : public MCParsedAsmOperand {
  enum KindTy {
    k_Immediate,
    k_Memory,
    k_Register,
    k_Token
  } Kind;

(2) Instruction Encoder

Cpu0后端的Instruction encoder的作用是通过Cpu0MCCodeEmitter类将LLVM MCInst编码为二进制码。其具体实现位于'Cpu0MCCodeEmitter.cpp'中:

void Cpu0MCCodeEmitter::encodeInstruction(const MCInst &MI, raw_ostream &OS,
                                          SmallVectorImpl<MCFixup> &Fixups,
                                          const MCSubtargetInfo &STI) const {
  uint64_t Binary = getBinaryCodeForInstr(MI, Fixups, STI);

  // Check for unimplemented opcodes.
  unsigned Opcode = MI.getOpcode();
  if ((Opcode != Cpu0::NOP) && !Binary)
    llvm_unreachable("unimplemented opcode in encodeInstruction()");

  const MCInstrDesc &Desc = MCII.get(MI.getOpcode());
  uint64_t TSFlags = Desc.TSFlags;

  // Pseudo instruction don't get encoded
  // and shouldn't be here in the first place.
  if ((TSFlags & Cpu0II::FrmMask) == Cpu0II::Pseudo)
    llvm_unreachable("Pseudo opcode found in encodeInstruction()");

  // For now all instruction are 4 bytes or 8 bytes
  int Size = Desc.getSize(); // FIXME: Have Desc.getSize() return the correct value

  EmitInstruction(Binary, Size, OS);
}

(3) Instruction Decoder

Cpu0后端的Instruction Decoder的作用是将一个抽象的字节序列转化成一个'MCInst'和一个'Size'，其由一个源文件'Cpu0Disassembler.cpp'组成，扩展了'MCDisassembler'类，并以'getInstruction'函数为中心。此函数对字节序列进行解码，并将此信息储存在提供的'MCInst'中：

DecodeStatus Cpu0Disassembler::getInstruction(MCInst &Instr, uint64_t &Size,
                                              ArrayRef<uint8_t> Bytes,
                                              uint64_t Address,
                                              raw_ostream &VStream,
                                              raw_ostream &CStream) const {
  DecodeStatus Result;
  const unsigned MaxInstBytesNum = (std::min)((size_t)8, Bytes.size());
  Bytes_ = Bytes.slice(0, MaxInstBytesNum);

  do {
    if (Bytes_.size() >= 8) {
      uint64_t Insn;
      Result = readInstruction64(Bytes_, Address, Size, Insn, IsBigEndian);
      if (Result == MCDisassembler::Fail)
        return Result;

      // Calling the auto-generated decoder function
      Result = decodeInstruction(DecoderTableCpu064, Instr, Insn, Address, this,
                                 STI);
      if (Result != MCDisassembler::Fail) {
        Size = 8;
        break;
      }
    }

    if (Bytes_.size() < 4)
      break;

    Bytes_ = Bytes_.slice(0, 4);
    uint32_t Insn;
    Result = readInstruction32(Bytes_, Address, Size, Insn, IsBigEndian);
    if (Result == MCDisassembler::Fail)
      return Result;
    // Calling the auto-generated decoder function
    Result = decodeInstruction(DecoderTableCpu032, Instr, Insn, Address, this, STI);
    if (Result != MCDisassembler::Fail) {
      Size = 4;
      break;
    }
  } while (false);
  return Result;
}

'Cpu0Disassembler.cpp'提供了获取指令操作数类型、编码的寄存器值、以及内存中可以找到该指令的地址的函数。这些函数的名称由TableGen在构建解码表时定义但遵循'DecodeRegClassRegisterClass'的形式：

static DecodeStatus DecodeGPROutRegisterClass(MCInst &Inst,
                                              unsigned RegNo,
                                              uint64_t Address,
                                              const void *Decoder) {
  return DecodeCPURegsRegisterClass(Inst, RegNo, Address, Decoder);
}

static DecodeStatus DecodeSRRegisterClass(MCInst &Inst,
                                          unsigned RegNo,
                                          uint64_t Address,
                                          const void *Decoder) {
  return DecodeCPURegsRegisterClass(Inst, RegNo, Address, Decoder);
}

static DecodeStatus DecodeSimm14(MCInst &Inst,
                                 unsigned Insn,
                                 uint64_t Address,
                                 const void *Decoder) {
  Inst.addOperand(MCOperand::createImm(SignExtend32<14>(Insn)));
  return MCDisassembler::Success;
}

static DecodeStatus DecodeSimm32(MCInst &Inst,
                                 unsigned Insn,
                                 uint64_t Address,
                                 const void *Decoder) {
  Inst.addOperand(MCOperand::createImm(SignExtend32<32>(Insn)));
  return MCDisassembler::Success;
}

3. 总结

LLVM MC 引入 MCInst 类，使得能够在合适位置添加指令描述 (instruction description)，而同时获得汇编器 (assembler)、反汇编器 (disassembler) 和编译器后端支持 (compiler backend support)。对于新的LLVM后端如果要实现MC层的功能，需要添加与具体后端相关的AsmParser,以及实现与具体后端相关的反汇编功能。

本文针对 LLVM MC做了简单的介绍。受限于笔者知识水平，文中可能会存在某些理解身上的偏差，欢迎批评指正。