[v3.13][v3.14][Regression] kthread:makekthread_create()killable

Wed Mar 19 11:49:25 UTC 2014

Oleg Nesterov wrote:
> > > If we need the urgent hack to fix the regression, then I suggest to change
> > > scsi_host_alloc() temporary until mptsas (or whatever) is fixed.
> >
> > Device initialization taking longer than 30 seconds is possible and is not a
> > hang up. It is systemd which needs to be fixed.
> 
> Perhaps systemd needs the fix too, I do not know. But this is irrelevant,
> I think. Or at least this should be discussed separately.

I confirmed that this problem goes away if systemd-udevd supports longer
timeout.

> 
> kthread_run() can fail anyway, mptsas_probe() should not crash the kernel.

Right. But mptsas_probe() triggering an OOPS is irrelevant to kthread_run()
( comment #27 ).

> 
> And btw, it is not clear to me if in this case device initialization really
> needs more than 30 seconds... My understanding is probably wrong, so please
> correct me. But it seems that before your "make kthread_create() killable"
> 
> 	- probe hangs
> 
> 	- SIGKILL wakes it up
> 
> 	- so I assume that the probe was interrupted and didn't finish
> 	  correctly ???
> 
> 	- initialization continues, does scsi_host_alloc(), etc, and
> 	  everything works fine even if probe was interrupted?
> 

I confirmed that device initialization really took more than 30 seconds
( comments #51 and #52 ).

> So perhaps that probe should not hang and this should be fixed too ?
> Do you know where exactly it hangs? And where it is woken up by SIGKILL ?
> Or I totally misunderstood ?

The probe did not hang. SIGKILL affected only wait_for_completion_killable()
in kthread_create_on_node() called by mptsas_probe() via scsi_host_alloc().
Thus, the probe was interrupted because kthread_run() returned an error.

> > I think we need a bit different version, in order to take TIF_MEMDIE flag into
> > account at the caller of kthread_create(), for the purpose of commit 786235ee
> > is "try to die as soon as possible if chosen by the OOM killer".
> >
> > 	for (;;) {
> > 		shost->ehandler = kthread_run(scsi_error_handler, shost,
> > 					      "scsi_eh_%d", shost->host_no);
> > 		if (PTR_ERR(shost->ehandler) != -EINTR ||
> > 		    test_thread_flag(TIF_MEMDIE))
> 
> Well, personally I do not care about TIF_MEMDIE/oom at all. We need the
> temporary hack (unless we have the "right" fix right now) which should be
> reverted later.

I do seriously care about TIF_MEMDIE/oom. Last week I respond to a trouble
which hit "kernel: request_module() OOM local DoS" (RHBZ #853474) without
any malice.

> Not sure I understand... Yes, wait_for_completion_killable() can return
> immediately if TIF_SIGPENDING will be set again for any reason. Say, another
> signal. But the next iteration will clear TIF_SIGPENDING ?
> 
> >      As I think it is difficult to prove that kmalloc(GFP_KERNEL) never sets
> >      TIF_SIGPENDING flag
> 
> Ah, I see, you mean that kmalloc() can do this every time. No, this should
> not happen or we have another problem.

Then, what happens if somebody does

  while (1)
      kill(pid, SIGKILL);

where pid is the process calling kthread_run() from the "for (;;)" loop in
scsi_host_alloc()? Theoretically, it will form an infinite retry loop.
Clearing TIF_SIGPENDING does not guarantee that next
wait_for_completion_killable() does not return immediately.
Doing retry decision at scsi_host_alloc() will make things worse than
doing it at kthread_create_on_node().

> Anyway. I agree with any hack in scsi_host_alloc/etc, this is up to
> maintainers. I still think that your change uncovered the problems in
> drivers/message/fusion/, these problems should be fixed somehow.
> 
> Dear maintainers, we need your help.
> 

Right. We found that we can fix this problem by updating systemd-udevd to
support longer timeout ( comment #53 ). Joseph, would you consult systemd
maintainers?