分类 运维日寄 下的文章

https://help.aliyun.com/document_detail/412618.html?spm=5176.2020520152.help.19.740b16ddtBrgBC#section-elq-kxx-bzx

状态含义常见原因
Pendingpod无法被调度到node上污点、容忍问题
资源/依赖不足
端口耗尽
ImagePullBackOff已调度,无法拉取镜像镜像信息不正确
镜像仓库网络不通
CrashLoopBackOff应用程序有问题查看日志检查应用
检查容器健康检查的配置
Completed应用已执行完毕没有持续运行的后台程序、或程序已执行完毕
Terminatingpod正在删除k8s会在新容器起来后再去下旧容器,属于正常现象,如果长时间停留在这个状态可以手动delete
检查健康检查delayseconds

查看pod的时候发现一个一直处于teterminating的pod,用delete去删除,然而直接卡在那里删不掉
根据https://blog.csdn.net/lisongyue123/article/details/118966921 说的,去修改他的finalizers字段,但是实际去修改的时候发现这个pod并没有这个字段,再次仔细查看pod信息,发现这个pod原本部署于node01,但是这个node在昨天因为欠费已经被下线了(悲)那删不掉就很正常了,毕竟机子都没了,那kubelet也早就跟着一块寄了
kubectl get node看下,确实有个寄了的node

[root@master ~]# kubectl get node
NAME     STATUS     ROLES                  AGE   VERSION
master   Ready      control-plane,master   28d   v1.22.4
node01   NotReady   <none>                 28d   v1.22.4

于是去删除,先尝试使用drain优雅的去去除,结果因为这个幽灵pod,不给删
换成delete,node成功给删除了,但是一查询,pod还在,但是无所谓,我会--force
强制去删除,结果这时给我报了个pod不存在,再一查确实没了,就是不知道是因为node没了一块给删掉的还是给强制命令删掉的,如果还有下次机会可以两个都试试.....但是最好别有下一次

接到个任务,写个k8s相关的小工具,目标集群在阿里云上
拉了库下来照着demo写,结果报了个错
Max retries exceeded with url: /api/v1/namespaces (Caused by SSLError(SSLError(397, '[SSL: CA_KEY_TOO_SMALL] ca key too small (_ssl.c:3862)')))
厄厄,这个config是从官网集群直接复制下来的,本地kubectl可用,我确定不是证书搞错了原因
然后折腾了半天试图跳过证书验证,client里面有个api_client.configuration.verify_ssl,可惜不知道是不是我用的方式不对,用了configuration做配置,config就不起效果了
折腾半天看到一个同样被阿里云坑到的老哥,大意说是openssl安全等级原因,默认2级需要ca长度为2048,坑爹阿里云是1024,然后就CA_KEY_TOO_SMALL了
image.png
楼下有人给了更具体的解决方式
image.png
但是直接复制进去有可能你会发现不起效果,那是因为库的版本不对
完整解决方法:

pip install requests "urllib3<2"
import urllib3
urllib3.util.ssl_.DEFAULT_CIPHERS = "ALL:@SECLEVEL=1"

你可以使用strace -c cmd /-p pid来追踪某个进程或者指令的系统调用,他们的打印结果略有不同

[root@master ~]# strace -c pwd
/root
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 21.71    0.000104          11         9           mmap
 16.91    0.000081          16         5           close
 10.86    0.000052          17         3         3 access
 10.02    0.000048          12         4           mprotect
  8.98    0.000043          14         3           open
  8.14    0.000039          19         2           munmap
  6.68    0.000032           8         4           fstat
  6.05    0.000029           7         4           brk
  4.38    0.000021          21         1           write
  2.71    0.000013          13         1           execve
  2.09    0.000010          10         1           arch_prctl
  0.84    0.000004           4         1           getcwd
  0.63    0.000003           3         1           read
------ ----------- ----------- --------- --------- ----------------
100.00    0.000479                    39         3 total

当使用-p 的时候,打印出来的文本量可能会有点超出你的想象..

[root@master ~]# strace -p 1
strace: Process 1 attached
epoll_wait(4, [{EPOLLIN, {u32=1532991024, u64=93829388539440}}], 33, -1) = 1
clock_gettime(CLOCK_BOOTTIME, {tv_sec=319337, tv_nsec=129368879}) = 0
recvmsg(18, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="WATCHDOG=1", iov_len=4096}], msg_iovlen=1, msg_control=[{cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS, cmsg_data={pid=365, uid=0, gid=0}}], msg_controllen=32, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 10
open("/proc/365/cgroup", O_RDONLY|O_CLOEXEC) = 16
fstat(16, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f32446e7000
read(16, "11:memory:/system.slice/systemd-"..., 1024) = 370
close(16)                               = 0
munmap(0x7f32446e7000, 4096)            = 0
timerfd_settime(3, TFD_TIMER_ABSTIME, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=319494, tv_nsec=691126000}}, NULL) = 0
epoll_wait(4, [{EPOLLIN, {u32=1532730112, u64=93829388278528}}], 33, -1) = 1
clock_gettime(CLOCK_BOOTTIME, {tv_sec=319337, tv_nsec=131285975}) = 0
recvmsg(38, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\4\1\1 \0\0\0@\r\0\0\211\0\0\0\1\1o\0\25\0\0\0", iov_len=24}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 24
recvmsg(38, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="/org/freedesktop/DBus\0\0\0\2\1s\0\24\0\0\0"..., iov_len=168}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 168
epoll_wait(4, [{EPOLLIN, {u32=1532730112, u64=93829388278528}}], 33, -1) = 1
clock_gettime(CLOCK_BOOTTIME, {tv_sec=319337, tv_nsec=131777119}) = 0
recvmsg(38, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\1\0\1@\1\0\0~/\0\0\305\0\0\0\1\1o\0\31\0\0\0", iov_len=24}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 24
recvmsg(38, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="/org/freedesktop/systemd1\0\0\0\0\0\0\0"..., iov_len=512}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 512
sendmsg(38, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\1\0\1\t\0\0\0\0E\1\0\207\0\0\0\1\1o\0\25\0\0\0/org/fre"..., iov_len=152}, {iov_base="\4\0\0\0:1.1\0", iov_len=9}], msg_iovlen=2, msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 161
recvmsg(38, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\2\1\1\4\0\0\0\311\237\0\0=\0\0\0\6\1s\0\4\0\0\0", iov_len=24}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 24
recvmsg(38, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base=":1.0\0\0\0\0\5\1u\0\0E\1\0\10\1g\0\1u\0\0\7\1s\0\24\0\0\0"..., iov_len=60}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 60
getuid()                                = 0
stat("/run/systemd", {st_mode=S_IFDIR|0755, st_size=400, ...}) = 0
mkdir("/run/systemd/system", 0755)      = -1 EEXIST (File exists)
stat("/run/systemd/system", {st_mode=S_IFDIR|0755, st_size=1200, ...}) = 0
umask(077)                              = 000
open("/run/systemd/system/.#session-811.scopeHaHZNz", O_RDWR|O_CREAT|O_EXCL|O_CLOEXEC, 0600) = 16
umask(000)                              = 077
fcntl(16, F_GETFL)                      = 0x8002 (flags O_RDWR|O_LARGEFILE)
umask(0777)                             = 000
fchmod(16, 0644)                        = 0
umask(000)                              = 0777
fstat(16, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f32446e7000
write(16, "# Transient stub\n", 17)     = 17
rename("/run/systemd/system/.#session-811.scopeHaHZNz", "/run/systemd/system/session-811.scope") = 0
close(16)                               = 0
munmap(0x7f32446e7000, 4096)            = 0
stat("/run/systemd/system", {st_mode=S_IFDIR|0755, st_size=1220, ...}) = 0
mkdir("/run/systemd/system/session-811.scope.d", 0755) = 0
umask(077)                              = 000
open("/run/systemd/system/session-811.scope.d/.#50-Slice.confd5TEip", O_RDWR|O_CREAT|O_EXCL|O_CLOEXEC, 0600) = 16
umask(000)                              = 077
fcntl(16, F_GETFL)                      = 0x8002 (flags O_RDWR|O_LARGEFILE)
umask(0777)                             = 000
fchmod(16, 0644)                        = 0
umask(000)                              = 0777
fstat(16, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f32446e7000
write(16, "[Scope]\nSlice=user-0.slice\n", 27) = 27
rename("/run/systemd/system/session-811.scope.d/.#50-Slice.confd5TEip", "/run/systemd/system/session-811.scope.d/50-Slice.conf") = 0
close(16)                               = 0

如果你想保存打印结果,你可以使用 strace -o outputfilename pwd这种方式,他会输出一个打印日志在当前目录下
如果你想知道某个系统调用在某个命令当中实时的调用情况,你可以这样这(以open这个syscall和ls这个命令为例子)

[root@master ~]# strace -t -e open ls
10:26:01 open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
10:26:01 open("/lib64/libselinux.so.1", O_RDONLY|O_CLOEXEC) = 3
10:26:01 open("/lib64/libcap.so.2", O_RDONLY|O_CLOEXEC) = 3
10:26:01 open("/lib64/libacl.so.1", O_RDONLY|O_CLOEXEC) = 3
10:26:01 open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
10:26:01 open("/lib64/libpcre.so.1", O_RDONLY|O_CLOEXEC) = 3
10:26:01 open("/lib64/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3
10:26:01 open("/lib64/libattr.so.1", O_RDONLY|O_CLOEXEC) = 3
10:26:01 open("/lib64/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3
10:26:01 open("/proc/filesystems", O_RDONLY) = 3
10:26:01 open("/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
flannel.yaml  jen-dep.yaml  jen-pv.yaml  jen-role.yaml  jen-svc.yaml  output
10:26:01 +++ exited with 0 +++

如果你想追踪基于某个文件运行起来的进程的子进程,那么你可以使用strace -f filename

一个进程结束后,会返回一个返回值,父进程会用wait()来获取这个返回值,等获取到之后,系统会回收进程的pcd并删除对应的进程,但如果父进程一直不获取,那么这个进程会处于僵尸状态

处理思路:

僵尸进程是已经结束的进程,无法用kill杀死,首先得去看看他的父进程出了什么事情,为什么不获取

ps -p pid -o stat | tail -n 1

如果父进程停止了(T),那么恢复父进程就行
kill -SIGCONT pid

如果父进程正常,用strace -p pid去追踪他的行为,看看是不是发生了死锁之类的

如果是孤儿进程(父进程先一步退出了),那么这种进程一般会被初始进程接管,这种情况下要是还是出了问题,那看看是不是初始进程出事了,方法参上

注:高版本的初始进程不再会被kill发送的停止信号影响,也不能被strace