数据

爬虫

从 StackOverflow 的问答中爬取数据,用于构造后端知识库

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
import requests
import concurrent.futures
import csv
from tqdm import tqdm

api_url = "https://api.stackexchange.com/2.3/search"

# 设置查询参数,使用关键字搜索与异常和报错相关的问题
params = {
"site": "stackoverflow",
"pagesize": 100, # 每次请求获取的问题数量
"intitle": "exception", # 使用关键字搜索
"key": "OcugRWcRkGc4BmksZoNdag((" # 替换为你的Stack Exchange API密钥
}

# 创建空的 questions 列表
questions = []
page = 1

# 创建 CSV 文件并写入标题行
with open(
"stackoverflow_data_exception.csv",
mode="w",
newline="",
encoding="utf-8"
) as file:
writer = csv.writer(file)
writer.writerow(["Question ID", "Question Title", "Answer ID", "Answer"])

# 使用多线程并行获取数据
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
while len(questions) < 5000:
# 发送请求
response = requests.get(api_url, params=params)

if response.status_code == 200:
data = response.json()
items = data.get("items", [])
if not items:
break

questions.extend(items)
page += 1
params["page"] = page
else:
print(f"Error: {response.status_code}")
break

# 处理问题及答案,这里处理逻辑与之前的代码类似
def process_question(question):
question_id = question["question_id"]
title = question["title"]

# 获取问题的答案
answers_api_url = f"https://api.stackexchange.com/2.3/questions/{question_id}/answers"

answers_params = {
"site": "stackoverflow",
"pagesize": 3, # 获取前3个答案,修改为你想要的数量
"order": "desc",
"sort": "votes", # 根据投票数排序
"filter": "withbody", # 只获取包含答案内容的数据
"key": "OcugRWcRkGc4BmksZoNdag((" # 替换为你的Stack Exchange API密钥
}

answers_response = requests.get(answers_api_url, params=answers_params)

if answers_response.status_code == 200:
answers_data = answers_response.json()
answers = answers_data.get("items", [])

# 仅处理有至少一个回答且回答数大于等于3的问题
if len(answers) >= 3:
# 输出问题标题
print(f"Question ID: {question_id}")
print(f"Title: {title}")

# 输出问题的前三个回答并写入CSV文件
for answer in answers[:3]:
answer_id = answer["answer_id"]
answer_body = answer["body"]
print(f"Answer ID: {answer_id}")

# 写入CSV文件的stackoverflow_data_每一行
with open(
"stackoverflow_data_exception.csv",
mode="a",
newline="",
encoding="utf-8"
) as file:
writer = csv.writer(file)
writer.writerow([question_id, title, answer_id, answer_body])
print("\n" + "=" * 50 + "\n")
else:
print(f"Error getting answers for question {question_id}")

# 使用多线程处理问题
for question in questions:
executor.submit(process_question, question)

标注

由于该项目为课程作业,本着能跑就行的原则,数据标注环节全面简化,直接取用爬取的数据的原文

图数据库 schema 设计如下:

  • Vertex: Question{ title, tags, embedding }

  • Vertex: Answer{ content }

  • Edge: SolvedBy { }

导入

由于数据量较小,选择直接使用 Cypher 语句进行数据导入

以一个 Question 和其对应的三个 Answer 为例:

1
2
3
4
5
6
7
8
9
10
CREATE (q:Question{title:"java error installing running elastic stack in Windows 10",tags:"java;elastic-stack"});

MATCH (q:Question{title:"java error installing running elastic stack in Windows 10",tags:"java;elastic-stack"})
CREATE
(a1:Answer{content:"You'r facing this problem due to wrong java folder location. Change or move your java folder to program files/Java and updated java path to ur system path and this will solve your problem .. !! This worked for me.. "}),
(a2:Answer{content:"First chack your JAVA_HOME, it shoud point to &quot;C:/Program Files/Java/jdk-15&quot; if you open service.bat you will see elasticsearch use %JAVA_HOME%/bin/java.exe, so your JAVA_HOME shoud not have /bin part. "}),
(a3:Answer{content:"Try using the docker ELK Stack instead installing everything manually - https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html And you can run this in a VirtualBox instance of Fedora or Ubuntu so that you don't break your windows environment. "}),
(q)-[:SolvedBy]->(a1),
(q)-[:SolvedBy]->(a2),
(q)-[:SolvedBy]->(a3);
1
2
3
4
5
6
7
from openai.embeddings_utils import get_embedding

g = Neo4jGraph()
title = "java error installing running elastic stack in Windows 10"
embedding = get_embedding(text=title, engine="text-embedding-ada-002")
query = f"MATCH (n:Question) WHERE n.title=$title SET n.embedding=$embedding"
g.query(query=query, param_map={"title": title, "embedding": embedding})

后端

Web 框架

FastAPI 是一个用于构建 API 的现代、快速(高性能)的 Web 框架,使用 Python 3.8+ 并基于标准的 Python 类型提示

关键特性:

  • 快速:可与 Node.js 和 Go 并肩的极高性能(归功于 Starlette 和 Pydantic),最快的 Python web 框架之一
  • 高效编码:提高功能开发速度约 200% 至 300%
  • 更少 bug:减少约 40% 的人为(开发者)导致错误
  • 智能:极佳的编辑器支持;处处皆可自动补全,减少调试时间
  • 简单:设计的易于使用和学习,阅读文档的时间更短
  • 简短:使代码重复最小化;通过不同的参数声明实现丰富功能,bug 更少
  • 健壮:生产可用级别的代码;还有自动生成的交互式文档
  • 标准化:基于(并完全兼容)API 的相关开放标准:OpenAPI

业务逻辑

Query 接口

输入:用户提问(自然语言)

处理流程:

  1. 调用 OpenAI Embedding API 生成 ”用户提问(自然语言)“ 的嵌入向量
  2. 查询所有 Question 节点的嵌入向量,分别计算与用户提问嵌入向量的余弦相似度
  3. 取相似度最高的 K 个 Question 节点,保留其中相似度超过相似度阈值的节点,作为 “相关 Question 节点”
  4. 查询 “相关 Question 节点” 的所有对应 Answer 节点
  5. 将查询结果序列化为字符串,作为上下文信息,结合自然语言的用户提问,生成提示词
  6. 调用 OpenAI Chat Completion API 生成最终回答

输出:LLM 基于知识图谱查询结果给出的回答

CI/CD

  • 制作 Docker 镜像
1
2
3
4
5
6
7
8
9
10
11
12
13
FROM python:3.8

ENV OPENAI_API_BASE "https://api.openai-proxy.com/v1"

WORKDIR /code

COPY ./requirements.txt /code/requirements.txt

RUN pip install --no-cache-dir --upgrade -r /code/requirements.txt -i https://mirrors.aliyun.com/pypi/simple/

COPY ./app /code/app

CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "80"]
  • 使用 Github Action 自动化测试和部署
1
2
3
$ mkdir .github/workflows
$ cd .github/workflows
$ touch deploy.yml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
name: Docker Deploy to Cloud Server

on:
push:
branches:
- main

jobs:
unittest:
runs-on: ubuntu-latest
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
OPENAI_API_BASE: "https://api.openai-proxy.com/v1"
steps:
- name: Checkout code
uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: 3.8
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
- name: Run tests
run: python -m unittest discover

deploy:
needs: unittest
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v2
- name: Copy files to server
uses: appleboy/scp-action@master
with:
host: ${{ secrets.SERVER_IP }}
username: ${{ secrets.SERVER_USER }}
password: ${{ secrets.SERVER_PASSWORD }}
source: "."
target: "/path/to/backend"
- name: Build and Run Docker on server
uses: appleboy/ssh-action@master
with:
host: ${{ secrets.SERVER_IP }}
username: ${{ secrets.SERVER_USER }}
password: ${{ secrets.SERVER_PASSWORD }}
script: |
cd /path/to/backend
docker build -t asetp_backend .
docker stop asetp_backend_container || true
docker rm asetp_backend_container || true
docker rm asetp_backend_container || true
docker run -d --name asetp_backend_container -p 8000:80 -e OPENAI_API_KEY=${{ secrets.OPENAI_API_KEY }} asetp_backend
docker image prune -f

前端